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1  Overview. 

NOTE:  The  most  important  deliverable  resulting  from  this  award  is 
the  Shadow  Networks  project.  It  is  described  in  more  detail  in  Sect. 

El 


This  award  supported  extensions  to,  and  novel  applications  of,  an  increasingly  useful  machine 
learning  methodology  known  as  symbolic  regression.  The  PI  of  this  award  was  involved  in  earlier 
work  that  established  this  approach  as  a  powerful  method  for  discovering  previously  unknown 
relationships  within  and  among  arbitrary  data  set^j 

Unlike  more  traditional  linear  and  nonlinear  regression  methods,  which  attempt  to  find  coef¬ 
ficients  (a)  for  equations  with  terms  selected  by  the  investigator,  symbolic  regression  attempts  to 
make  little  to  no  assumptions  about  the  form  of  the  right-hand  side  of  a  set  of  equations.  In  other 
words,  the  investigator  does  not  need  to  select  linear  or  nonlinear  terms  a  priori.  Symbolic  regres¬ 
sion  is  also  preferred  over  other  state-of-the-art  machine  learning  methods  such  as  deep  learning 
because  symbolic  regression  is  a  white  box  modeling  method:  often,  the  models  it  produces  may 
be  highly  nonlinear  yet  compact,  readable  by  any  member  from  the  domain  of  interest  who  is 
mathematically  literate. 

For  example,  one  result  from  this  award  was  a  model  that  can  successfully  predict  the  time 
it  takes  for  information  to  flow  from  individual  i  to  individual  j  (Ti;)  as  a  function  of  structural 
properties  of  the  social  network  in  which  those  individuals  are  embedded.  One  such  property  is 
Lij,  the  shortest  path  in  the  network  from  i  to  j.  Trained  against  social  network  ‘chatter’  (people 
tweeting  and  retweeting  information),  symbolic  regression  constructed  this  model 

N  —  k2  —  k2 

T%j  =  Lij{  1  +  In  +  ki  +  kj  —  Cj  +  N  — ))  (1) 

J^ij  +  Kitij  —  p 

which,  even  to  a  casual  observer,  can  see  that  it  takes  longer  for  information  to  flow  from  individ¬ 
ual  i  to  j  if  they  are  more  distant  from  one  another  in  the  social  network.  However,  the  additional 
mathematical  structure  in  this  model  indicates  that  there  are  more  subtle  influences  between  indi¬ 
viduals’  location  in  a  social  network  and  how  long  it  takes  for  information  to  flow  between  them. 

Symbolic  regression  is  useful  in  that  it  can  often  find  relationships  within  a  data  set  that  are 
missed  by  other  regression  methods  because,  in  the  latter  case,  the  investigator  may  make  the 
wrong  assumptions  about  what  kinds  of  relationships  may  exist  and  thus  include  inappropriate 
terms.  Symbolic  regression  avoids  this  by  allowing  the  investigator  to  make  little  or  no  assumptions 
about  what  relationships  may  exist  in  a  data  set. 

Over  the  course  of  this  award,  four  projects  were  pursued:  the  first  three  involve  applying 
symbolic  regression  to  novel  domains  such  as  social  networks,  brain  imaging,  and  satellite  imagery. 
The  fourth  project  involved  theoretical  work  to  improve  symbolic  regression  itself.  These  four 
projects  are  summarized  as  follows: 

1.  The  Shadow  Networks  project.  We  have  successfully  adapted  symbolic  regression  for 
addressing  the  node  prediction  problem  in  social  network  data  [0 :  if  a  person,  along  with 

'Bongard  J.  and  Lipson  H.(2007).  Automated  reverse  engineering  of  nonlinear  dynamical  systems.  Proceedings 
of  the  National  Academy  of  Sciences.  104(24):  9943-9948. 
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all  their  messages,  is  deliberately  ‘scrubbed’  from  chatter  collected  from  a  social  network, 
how  could  one  not  only  identify  that  such  tampering  had  occurred,  but  where  that  missing 
node  may  lie  in  the  network  (i.e.  identify  the  friends  of  the  erased  person).  More  details  are 
provided  in  Sect.  [2] 

2.  Symbolically  regressing  brain  imaging  data.  Using  functional  Magnetic  Resonance  Imag¬ 
ing  (fMRI)  data  collected  prior  to  this  award,  we  have  shown  that  applying  symbolic  re¬ 
gression  can  find  heretofore  unknown  relationships  between  brain  regions,  and  that  those 
relationships  can  be  used  to  predict  behavioral  tendencies  of  the  participants.  As  example, 
we  found  that  a  model  produced  by  symbolic  regression  could  be  used  to  predict  whether  an 
adolescent  regularly  consumed  alcohol  or  not  dTJ .  The  model  made  successful  predictions  by 
finding  previously  unknown  relationships  between  regions  of  the  brain  implicated  in  reward, 
emotion,  and  thirst.  More  details  are  provided  in  Sect.  [3] 

3.  Symbolically  regressing  satellite  imagery.  In  addition  to  large-scale  data  sets  produced  by 
medical  scans  such  as  fMRI,  environmental  modeling  from  satellite  data  is  another  promis¬ 
ing  domain  for  symbolic  regression.  This  award  has  enabled  us  to  demonstrate  a  modeling 
approach  that  intelligently  balances  requests  for  data  for  modeling  against  the  differing  costs 
of  data  produced  by  less-  or  more-expensive  sensors  (9l|.  We  have  also  adapted  symbolic 
regression  for  use  with  actual  satellite  data  for  predicting  the  amount  of  water  contained  in 
the  snows  of  the  Hindu  Kush  J3][6]|.  More  details  are  provided  in  Sect.  |4j 

4.  Improving  symbolic  regression.  Much  theoretical  work  has  been  accomplished  as  a  result 
of  this  award  in  order  to  improve  symbolic  regression.  One  pair  of  publications  demonstrated 
that  symbolic  regression  can  be  hybridized  with  other  gradient-descent  regression  methods 
to  produce  a  combined  algorithm  that  outperforms  either  approach  working  alone  0  Q. 
The  heart  of  symbolic  regression  relies  on  stochastic  modifications  to  existing  models  to 
sometimes  discover  more  accurate  new  models.  In  more  recent  work  EIBI  we  have  shown 
that  these  modifications  can  be  de-randomized  somewhat  to  improve  symbolic  regression’s 
ability  to  discover  more  accurate  and  more  parsimonious  models.  More  details  about  these 
advances  are  provided  in  Sect.  [5] 

1.1  Human  capital  return  on  investment. 

This  award  supported  two  postdoctoral  associates:  Ilknur  Icke  and  Nicholas  Allgaier. 

Dr.  Icke  is  now  a  senior  engineer  of  scientific  computing  for  Merck.  There,  she  is  developing 
high  throughput  implementations  for  vaccine  development  and  medical  image  registration.  She 
is  also  applying  deep  learning  methods  for  automatically  identifying  the  location  of  cardiac  left 
ventricles  in  medical  scans. 

Dr.  Allgaier  is  now  a  postdoctoral  associate  in  the  University  of  Vermont  Medical  Center,  work¬ 
ing  under  the  supervision  of  Hugh  Garavan,  one  of  the  Principal  Investigators  of  the  Adolescent 
Brain  Cognitive  Development  (ABCD)  Study,  the  largest  long-term  study  of  brain  development 
and  child  health  in  the  United  States,  and  the  ENIGMA  Study,  an  attempt  to  construct  one  of 
the  largest  multi-site,  data  pooled,  genetic  and  neuroimaging  data  sets.  Dr.  Allgaier  is  presently 
applying  some  of  the  methods  developed  as  part  of  this  award  to  data  from  both  of  these  studies. 
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1.2  Other  impacts  from  this  award. 

•  Symbolic  regression  has  become  a  common  tool  among  the  faculty,  postdoctoral  associates, 
and  graduate  students  who  comprise  the  Vermont  Complex  Systems  Center.  We  graduate 
about  a  half  dozen  graduate  students  and  postdoctoral  associates  a  year  who  go  on  to  take 
up  prominent  positions  in  academia  and  industry.  Most  members  are  involved  in  analyzing 
and  synthesizing  complex  neural,  biological,  technological,  and  social  networks.  Common 
application  domains  involve  social  network  analysis,  the  smart  grid,  cyberinfrastructure,  and 
sociotechnical  systems. 

•  PI  Bongard  has  delivered  a  number  of  presentations  on  work  drawn  from  this  award: 


May,  2016  Trusted  autonomous  systems.  (ACFR,  University  of  Sydney,  Australia;  Invited) 

May,  2016  Trusted  autonomous  systems.  (Inti.  Symp.  on  Trusted  Autonomous  Systems,  Australia;  Keynote) 
Mar,  2016  Philosophical  implications  of  robotics.  (UPitt  HPS  Annual  Lecture  Series;  Invited) 

Feb,  2016  Evo  devo  robo.  (University  of  Toronto  Cognitive  Science  Symposium;  Invited) 

Dec,  2015  ShanghAI  lecture  (simulcast  to  classrooms  in  Europe  and  Asia;  Invited) 

Dec,  2015  New  Jersey  Institute  of  Technology  (host:  Gal  Haspel,  biology;  Invited) 

May,  2015  Factory  of  Imagination  lecture,  Denmark  (500  attendees;  Keynote) 

Feb,  2015  ShanghAI  lecture  (simulcast  to  classrooms  in  Europe  and  Asia;  Invited) 

Nov,  2014  Cornell  Univeristy  (host:  Robert  Shepherd,  engineering;  Invited) 

Sept,  2014  University  of  Maryland  workshop  on  soft  robotics  (Invited) 

Aug,  2014  Scifoo  (hosts:  Nature,  Google,  O’Reilly  Media,  Digital  Science;  Invited) 

July,  2014  Workshop  on  Artificial  Life  and  the  Web  at  ALife  conference  (Invited) 

July,  2014  International  Society  for  Artificial  Life  (ISAL)  Summer  School  (Invited) 

June,  2014  DARPA  Biological  Technologies  Office  (Invited) 

June,  2014  Neural  Systems  &  Behavior  Summer  School,  Woods  Hole  Marine  Biology  Lab  (Invited) 

May,  2014  EPFL,  Lausanne,  Switzerland  (host:  Auke  Ispeert;  Invited) 

Mar,  2014  National  STEM  Conference  (Concept  Schools),  Cleveland,  OH  (Keynote) 

Mar,  2014  Air  Force  Research  Laboratories  (AFRL),  Rome,  NY  (Invited) 


Dec,  2013 
Nov,  2013 
Oct,  2013 
Sept,  2013 
Sept,  2013 
Aug,  2013 
July,  2013 
June,  2013 
June,  2013 
Mar,  2013 


ShanghAI  lecture  (simulcast  to  15  classrooms  in  Europe  and  Asia;  Invited) 

National  Autonomous  University  of  Mexico  (host:  Carlos  Gershenson;  Invited) 

University  of  Iowa  Delta  Center  (host:  Mark  Blumberg,  psychology;  Invited) 
eSMC  neuroscience/robotics  graduate  summer  school  (host:  Andreas  Engel;  Invited) 
Evolutionary  Biology  lecture,  University  of  Zurich  (host:  Andreas  Wagner;  Invited) 
Gordon  Research  Conference  on  Neuroethology  (host:  Heather  Eisten,  biology;  Invited) 
Soft  Robotics  Workshop  at  ETH,  Zurich  (host:  Fumiya  Iida,  robotics;  Keynote) 

Evolution  Meeting,  SSE  Presidential  Symposium  (host:  Richard  Lenski,  biology;  Invited) 
Evolution  Meeting,  Education  Symposium  (host:  George  Gilchrist,  NSF;  Invited) 
University  of  Texas  at  Austin  (host:  Dana  Ballard,  Computer  Science;  Invited) 


Nov,  2012  Vassar  College  (host:  John  Long,  biology;  Invited) 
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Nov,  2012  Harvard  University  (host:  Radhika  Nagpal,  engineering;  Invited) 

June,  2012  Tufts  University  (host:  Michael  Levin,  biology;  Invited) 

Apr,  2012  Tufts  University  (host:  Barry  Trimmer,  biology;  Invited) 

Jan,  2012  University  of  Southern  California  (host:  Francisco  Valero-Cuevas,  bioengineering;  Invited) 

1.3  Software  deliverables. 

•  The  source  code  for  two  versions  of  the  enhanced  symbolic  regression  method  developed 
throughout  this  award  are  available  publicly.  Either  of  these  methods  can  be  adapted  to  novel 
data  sets  by  anyone  proficient  in  Python  and  machine  learning  methods: 

-  The  github  repository  for  forward  semantic  propagation  in  symbolic  regression. 

-  The  github  repository  for  behavioral  diversity  in  symbolic  regression. 


2  The  shadow  networks  project. 

The  most  important  product  produced  by  this  award  is  the  ‘Shadow  Networks’  method.  It  is 
summarized  below,  and  a  manuscript  describing  its  technical  details  follows. 

An  important  problem  in  analyzing  data  generated  by  people  communicating  over  a  social 
network  is  identifying  whether  the  communications  have  been  deliberately  tampered  with.  Such 
challenges  can  be  broken  down  into  two  classes  of  problems:  link  prediction  and  node  prediction. 
In  the  link  prediction  problem,  it  is  assumed  that  edges  have  either  been  removed  from  a  social 
network  (i.e.,  information  about  relationships  between  pairs  of  individuals  have  been  erased)  or 
fictitious  links  have  been  added  (i.e.,  fictitious  relationships  have  been  embedded  in  the  network). 

Several  methods  now  exist  for  tackling  the  link  prediction  problem.  However,  before  our  work 
in  this  award,  there  were  no  methods  in  existence  for  tackling  the  much  harder  node  prediction 
problem:  a  node  and  all  of  its  edges  are  either  deliberately  erased  (someone,  along  with  all  their 
relationship  information,  is  removed)  or  added  (a  fictitious  actor  is  added  to  the  network). 

In  a  preliminary  publication  [O  we  introduced  a  method  for  successfully  addressing  the  node 
prediction  problem,  albeit  only  for  simulated  social  networks.  (We  are  currently  seeking  relevant 
real-world  social  network  data  for  this  project.)  We  have  termed  our  particular  approach  to  the 
node  prediction  problem,  which  employs  symbolic  regression  as  a  part,  the  Shadow  Networks 
approach. 

A  summary  of  the  approach  is  as  follows.  Imagine  one  has  access  to  several  social  networks.  In 
addition,  one  can  observe  not  only  the  structural  properties  of  that  network — who  is  connected  to 
whom,  and  how — but  also  dynamic  properties  of  that  network — how  information  flows  from  one 
person  to  another,  and  at  what  rate.  Armed  with  this  data,  it  is  possible  to  train  a  model,  using  data 
from  these  social  networks,  to  successfully  predict  how  long  it  generally  takes  for  information  to 
flow  from  one  person  to  another,  given  structural  properties  of  the  network.  In  essence  the  model 
has  the  form 


Tn  =  fiSuS^Si/) 


(2) 
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where  TV]  represents  the  time  (on  average)  it  takes  for  information  to  flow  from  individual  i  to 
individual  j,  S)  is  a  set  containing  structural  properties  of  i  (i.e.  how  many  friends  he  has  in  the 
network,  how  close  to  a  hub  he  is,  etc.),  Sj  is  a  set  containing  structural  properties  of  j,  and  St]  is 
a  set  containing  structural  properties  of  the  relationship  between  i  and  j  (e.g.  what  is  shortest  path 
linking  i  and  j). 

Once  the  model  leams  to  make  predictions  of  flow  from  structure,  one  can  apply  the  model  to 
a  new  social  network.  The  model  then  acts  in  a  diagnostic  fashion:  it  makes  predictions  for  flow 
between  each  pair  of  individuals  in  the  network,  and  if  its  predictions  systematically  fail,  it  is  likely 
that  that  network  has  been  tampered  with  in  some  way. 

To  develop  an  intuition  for  this  idea,  consider  exposing  a  model  to  a  series  of  pipes,  each  of 
which  is  composed  of  k  concrete  segments.  The  model  then  observes  water  poured  into  one  end 
of  the  pipe,  and  measures  how  long  it  takes  for  the  water  to  emerge  from  the  other  end  of  the 
pipe.  This  model  may  learn,  in  this  simple  case,  that  the  time  for  water  to  flow  through  the  pipe  is 
proportional  to  the  number  of  concrete  segments  making  up  the  pipe.  If  the  model  is  then  exposed 
to  another  pipe  made  up  of  three  segments,  but  it  takes  water  four  units  of  time  to  traverse  the  pipe, 
the  model  may  predict  that  there  is  a  fourth  segment  in  the  pipe  that  it  was  forbidden  to  see. 

In  the  manuscript  that  follows,  we  demonstrate  that  our  method  can  be  used  to  detect  whether 
nodes  have  been  removed  (omission)  or  added  (commission)  to  the  network.  Furthermore,  in  the 
case  of  node  removal,  the  model’s  error  tends  to  spike  for  individuals  who  are  close  to  the  hidden 
node.  This  provides  not  only  a  signal  that  someone  may  be  scrubbed  from  the  network,  but  who 
know  about  the  scrubbing — the  hidden  actor’s  colleagues,  as  evidenced  by  the  network  itself. 

There  are  several  limitations  that  currently  exist  with  the  method.  To  date  it  has  only  been 
validated  on  synthetic  data.  We  are  currently  seeking  data  from  real  social  networks  usable  for 
this  method.  Further,  it  assumes  that  the  data  on  which  the  models  are  trained  come  from  a  fully 
observable  network,  and  that  these  training  networks  have  not  yet  been  tampered  with.  Future 
work  will  address  these  limitations. 

2.1  Relevance  for  U.S.  defense  and  security. 

The  rapid  rise  of  big  data  is  posing  novel  challenges  for  security  and  defense,  especially  data 
arising  from  social  networks.  It  would  be  of  great  use  to  be  able  to  automatically  identify  whether 
data  from  social  networks  has  been  tampered  with,  and  specifically  whether  information  generated 
by  one  or  a  few  individuals  have  been  deliberately  erased. 

The  current  method  also  does  not  make  assumptions  about  what  kind  of  information  is  flowing 
across  the  network:  it  could  be  tweets  flowing  across  a  social  network,  packets  flowing  across  a 
computer  network,  or  text  messages  flowing  across  a  cellphone  network. 

Given  this,  it  is  possible  that  this  method  could  be  adapted  for  discovering  trojan  horses  in 
software  and/or  hardware  systems,  or  cyberinfrastructure  in  general. 

It  is  possible  that  the  method  could  be  adapted  for  other  domains  in  which  it  is  imperative  to 
find  hidden  individuals  in  a  social  network.  One  likely  future  domain  of  application  is  disease 
modeling.  If  individuals  suffering  from  the  outbreak  of  a  disease  will  not  or  cannot  report  to  a 
local  clinic,  data  about  those  infected  individuals  is  lost.  However,  if  their  friends  and  relatives  do 
report  to  the  clinic,  it  may  be  possible  to  indentify  a  group  of  individuals  who  share  social  ties  with 
an  individual  who  is  not  present  award  in  the  collected  data.  These  peripheral  individuals  could 
then  be  contacted  to  verify  the  existence  of  these  missing  individuals  and  how  to  contact  them. 
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2.2  Bagrow  et  al.  “Shadow  Networks...”  (2015). 

A  technical  manuscript  describing  the  shadow  networks  method  in  detail  follows. 
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Abstract 

Complex,  dynamic  networks  underlie  many  systems,  and  understanding  these  networks  is  the  concern  of  a  great 
span  of  important  scientific  and  engineering  problems.  Quantitative  description  is  crucial  for  this  understanding 
yet,  due  to  a  range  of  measurement  problems,  many  real  network  datasets  are  incomplete.  Here  we  explore  how 
accidentally  missing  or  deliberately  hidden  nodes  may  be  detected  in  networks  by  the  effect  of  their  absence  on 
predictions  of  the  speed  with  which  information  flows  through  the  network.  We  use  Symbolic  Regression  (SR)  to 
learn  models  relating  information  flow  to  network  topology.  These  models  show  localized,  systematic,  and  non- 
random  discrepancies  when  applied  to  test  networks  with  intentionally  masked  nodes,  demonstrating  the  ability  to 
detect  the  presence  of  missing  nodes  and  where  in  the  network  those  nodes  are  likely  to  reside, 

"james.  bagrow@uvm.edu 
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1  Introduction 


The  field  of  complex  networks  has  emerged  and  matured  over  the  last  15  years,  heralded  by  small-world  [1]  and 
scale-free  networks  [2],  and  principally  enabled  by  the  advent  of  readily  available  large-scale  datasets.  Much  work 
has  been  focused  on  simple  descriptions  of  complex  networks,  leading  to  an  evolving  collection  of  structures,  network 
statistics  [3,  4],  and  generative  mechanisms  [5,  6,  2], 

All  along,  the  problem  of  missing  data  has  been  both  obvious  and  ubiquitous — few  network  datasets  are  complete 
or  nearly  so — and  yet  this  issue  has  largely  been  ignored.  The  body  of  work  that  does  exist  on  missing  data  has  mostly 
focused  on  the  problem  of  unrecorded  edges  or  interactions  [7,  8,  9,  10],  while  only  some  have  explored  the  harder 
problems  of  node  and  context  omission  [11,  12,  13]  using  various  approaches  such  as  inference  based  on  maximum 
likelihood  estimation  [14,  15], 

While  missing  data  is  certainly  understood  to  affect — sometimes  dramatically — different  kinds  of  static  network 
statistics  in  different  ways  [11],  the  effects  of  measurement  error  on  dynamic,  real  social  networks  [16,  17,  18,  19] 
remain  largely  unknown.  This  problem  is  especially  challenging  when  the  amount  of  data  omission  is  not  known 
and  can  only  be  estimated  from  the  observed  data  set.  The  implications  for  how  to  contend  with  a  given  network, 
suspected  to  be  corrupted  in  some  fashion,  are  substantial.  In  the  case  of  public  health  policy,  for  example,  positive 
evidence  for  the  role  of  social  contagion  in  the  spreading  of  such  disparate  attributes  as  happiness  [20],  obesity  [21], 
and  loneliness  [22],  have  been  challenged  due  to  their  reliance  on  under-sampled  reconstructed  social  networks  [23], 

A  systematic  framework  to  accommodate  missing  data  for  static  and  dynamic  networks  remains  elusive,  and  pro¬ 
vides  a  great  challenge  to  the  network  science  community.  Much  success  in  the  study  of  complex,  dynamic  networks 
has  come  from  approaches  born  out  of  statistical  mechanics  and  dynamical  systems,  with  the  great  example  arguably 
being  Simon’s  rich-get-richer  model  underlying  scale-free  networks  [5, 6, 2],  Yet  it  is  clear  that  many  adaptive  complex 
systems  are  strongly  algorithmic  in  nature,  and  are  not  well  or  completely  described  by  integrodifferential  equations. 

Briefly,  our  approach  to  studying  missing  or  hidden  node  detection  is  as  follows.  First,  we  construct  a  set  of  net¬ 
work  topologies  (Sec.  2.1).  We  then  use  an  idealized  transaction  model  to  simulate  the  flow  of  information  “packets” 
across  these  networks.  These  packets  could  represent  IP  packets  flowing  across  a  computer  network,  citations  within 
a  scientific  collaboration  network,  or  messages  passed  among  members  of  a  social  network  such  as  Twitter  (Sec.  2.2). 
Next,  the  resulting  transaction  data  is  collected  and  fed  to  a  stochastic  optimization  method.  This  goal  of  this  step  is 
to  generate  a  mathematical  model  that  predicts  the  speed  of  information  flow  between  pairs  of  nodes  in  the  network, 
given  structural  information  about  those  nodes  and  the  network  they  were  drawn  from  (Sec.  2.3).  Finally,  the  evolved 
transaction  model  is  presented  with  rates  of  information  flow  between  nodes  from  a  different  network.  If  there  are  sys- 


8 

Approved  for  public  release;  distribution  is  unlimited. 


Figure  1:  Illustration  motivating  the  method.  Nodes  in  this  network  pass  information  around  (packets),  and  we  monitor  the  arrival 
times  of  these  packets.  The  two  blue  nodes  appear  much  farther  apart  topologically  when  the  red  node  is  hidden.  Given  the  observed 
information  flows,  the  highlighted  packet  would  appear  to  be  arriving  unusually  quickly  given  the  apparent  long  distance  path  it 
likely  took  (red  links).  This  unexpectedly  rapid  flow  may  be  a  clue  that  unseen  network  elements  are  present. 


tematic  errors  or  biases  in  the  model’s  prediction  of  information  flow,  this  indicates  that  nodes  may  have  been  added 
or  removed  from  the  network. 

The  intuition  underlying  our  approach  to  node  prediction  may  be  clarified  by  considering  the  cartoon  example 
in  Fig.  1.  Two  nodes  are  connected  by  a  third  node,  making  them  two  steps  apart  on  the  network  topology.  Due  to 
their  close  proximity,  information  should  flow  between  them  relatively  quickly,  on  average.  However,  if  the  bridge 
node  is  hidden  from  us,  we  may  erroneously  conclude  these  two  nodes  are  actually  quite  far  apart  (illustrated  in  the 
figure  by  the  red  path).  We  would  then  expect  information  flow  should  be  slow  between  them,  even  for  information 
originating  from  other  parties,  and  we  would  be  surprised  by  the  speed  of  flow  we  actually  observe.  If  we  consistently 
overestimate  the  time  it  takes  for  information  to  appear  at  one  node  after  it  appears  at  the  other,  then  this  provides 
evidence  that  a  hidden  presence  in  the  network  is  facilitating  the  flow  of  information. 


2  Methods 


Here  we  describe  the  network  topologies  we  will  employ  in  this  study,  the  details  of  how  we  simulate  information  flow 
on  these  topologies,  the  predictive  models  we  generate  for  the  flow  times,  and  the  test  procedure  and  measurements 
we  use  to  explore  how  well  hidden  nodes  can  be  detected. 
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2.1  Network  model 


To  gauge  the  potential  of  our  approach,  we  first  developed  it  against  simulated  transactional  networks.  We  used  scale- 
free  networks  generated  according  to  the  common  preferential  attachment  model  [5,  6,  2],  Each  scale-free  network 
was  grown  to  a  size  of  N  =  250  nodes  by  adding  new  nodes  one  at  a  time,  and  each  new  node  attached  to  two  existing 
nodes  preferentially  according  to  their  degree  [2],  These  undirected  networks  have  a  power-law  degree  distribution 
Pr(k)  ~  k  \  The  earlier  a  node  is  added  to  the  network,  the  higher  the  degree  it  will  tend  to  have.  So  hubs ,  highly 
connected  nodes,  tend  to  be  among  the  early  nodes  of  the  network. 

2.2  Transaction  dynamics 

For  each  network  that  was  constructed,  we  simulated  transactions,  the  creation  and  movement  of  packets  of  content, 
occurring  between  pairs  of  connected  nodes.  Each  packet  carries  a  unique  identifier  so  that  it  can  be  tracked  when  it 
appears  at  different  nodes  in  the  network,  and  each  node  maintains  a  growing,  time-ordered  list  of  the  content  packets 
it  has  received.  We  simulated  transactions  as  follows.  At  each  time  step,  each  node  is  activated  with  probability  p  —  1. 
This  may  represent  a  member  of  an  online  social  network  logging  into  their  account,  or  a  node  in  a  computer  network 
being  turned  on.  For  each  node  that  is  activated,  it  creates  a  new  piece  of  content  with  probability  pCreate  =  1/9 
or  imports  a  piece  of  content  from  a  neighbor  with  probability  pimpon  =  2/;CT(.atc.  In  the  former  case,  this  may 
correspond  to  a  member  of  the  social  network  Twitter  creating  a  new  tweet;  in  the  latter  case  it  may  correspond  to 
them  “retweeting”  a  tweet  from  someone  they  follow.  The  above  probabilities  were  chosen  to  plausibly  model  the 
relative  frequencies  of  creating  versus  importing  content;  an  experimenter  may  equally  estimate  their  values  from  a 
real  dataset. 

If  a  node  i  chooses  to  create  a  new  packet,  a  new  ID  is  generated  and  that  packet  is  added  to  V s  list  of  content. 
Neighbors  of  i  may  later  choose  to  import  this  new  packet  into  their  own  content  lists,  letting  it  spread  throughout  the 
network.  Importing  works  as  follows.  If  node  i  chooses  to  import  content,  one  of  i’s  neighboring  nodes  j  is  selected 
at  random  (assuming  it  has  neighbors).  Once  j  is  selected,  the  information  packets  in  j’s  list  are  scanned  from  most 
recently  generated  (or  imported)  to  earliest  generated  (or  imported).  The  scan  stops  when  an  information  packet  is 
found  that  is  not  contained  in  i’s  list.  If  no  such  packet  can  be  found,  no  action  for  node  i  is  taken  and  the  next  activated 
node  is  considered.  If  such  a  packet  is  found,  it  is  copied  from  node  j  to  node  i. 

This  process  is  repeated  for  the  next  node  that  has  been  activated  during  the  current  time  step.  The  simulation 
of  transactions  halts  when  3000  time  steps  elapse.  With  the  chosen  values  of  /?create  and  Pimport,  each  node  will  on 
average  participate  in  the  transaction  model  1000  times.  To  avoid  any  pathological  effects  the  nodes  are  activated  in 
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randomized  order  for  each  time  step. 

Once  the  transactions  have  been  simulated  and  we  have  a  timeline  of  packets  for  each  node,  we  then  compute 
the  average  time  it  takes  for  packets  to  flow  between  nodes.  For  every  pair  of  nodes  in  a  graph,  we  computed  the 
intersection  between  their  respective  sets  of  information  packets.  We  thus  obtained  each  packet  that  both  nodes  in  a 
pair  either  imported  or  created.  We  then  computed  the  average  time  7j;  required  for  packets  to  travel  between  nodes  i 
and  j 

n,j 


Tij  = 


(1) 


k-l 


where  n,-;  packets  are  shared  by  nodes  i  and  j,  and  tf  '  indicates  the  time  step  at  which  packet  k  was  created  (or  arrived) 
at  node  i.  In  order  to  remove  noise  resulting  from  small  sample  sizes,  all  Tt]  for  which  n,;  <  100  were  discarded.  Note 
that  we  are  not  measuring  a  causal  or  directional  relationship  between  the  node  pair;  a  shared  packet  could  easily  have 
been  created  by  a  third  node  and  then  eventually  reached  both  i  and  j  through  the  importing  process.  The  delay  time 
Tjj  is  a  dynamical  measure  of  closeness  between  the  nodes. 


2.3  Symbolic  Regression 

Given  a  network  topology  and  the  information  flow  times  7jy  (Eq.  (1)),  we  then  constructed  a  matrix  D  to  serve  as  the 
dataset  for  training  models  to  predict  7j;  as  a  function  of  the  structural  properties  of  nodes  i  and  j.  Each  pair  of  nodes  is 
allocated  its  own  row.  One  column  in  D  contains  the  7),  values,  while  the  remaining  columns  correspond  to  structural 
network  properties  of  node  ;,  node  j,  or  some  metric  relating  them.  An  experimenter  is  free  to  choose  which  metrics 
to  use.  The  individual  node  properties  we  used  were  node  degrees  k,,  kj\  clustering  coefficients  Cj,  c  f,  eccentricities  e,, 
e  f,  node  betweennesses  Bj,  Bj\  eigenvector  centralities  Xj,  Xj,  where  x,  is  the  7-th  element  of  the  leading  eigenvector  of 
the  network’s  adjacency  matrix;  and  closeness  centralities  Cj,  Cj.  For  node -pair  properties  we  used  the  length  T,;  of 
the  shortest  topological  path  between  i  and  j.  Finally,  we  included  global  network  properties  N ,  the  number  of  nodes; 
M.  the  number  of  edges;  r,  the  degree-mixing  assortativity  coefficient  [24];  and  the  graph’s  diameter  A  and  radius  p. 
These  global  quantities  were  the  same  for  all  rows  of  D ,  but  providing  them  gives  the  optimization  method  a  set  of 
plausible  constants  to  choose  front.1 

We  then  perform  symbolic  regression  (SR)  on  this  dataset  to  find  functions  /  that  predict  7),  as  a  function  of  the 

1  These  can  also  become  variables  if  one  chooses  to  apply  SR  to  a  dataset  containing  multiple  networks  of  different  sizes. 


11 

Approved  for  public  release;  distribution  is  unlimited. 


node  pair’s  structural  properties: 


Tij  = 


f(ki,  a,  e„  Bi,  xh  Cj,  kj,  Cj,ej,Bj,  xj,Cj, 
Lij,N,M,r,A,p). 


(2) 


Symbolic  regression  performs  model  selection  and  parameter  estimation  simultaneously  to  determine  the  functional 
form  of  Eq.  (2).  A  commonly-employed  method  for  instantiating  symbolic  regression  is  genetic  programming  [25],  a 
stochastic  optimization  method  that  simultaneously  optimizes  a  population  of  equations  to  increasingly  fit  the  supplied 
data  matrix  D.  As  the  name  implies,  this  method  is  loosely  based  on  Darwinian  evolution.  An  initial  population  of 
random  equations  are  assessed  against  D:  models  with  high  error  are  discarded,  while  models  with  lower  error  are 
retained.  The  now-vacant  slots  in  the  population  are  filled  by  repeatedly  copying  and  mutating  a  single  equation, 
or  producing  two  new  equations  by  performing  sexual  recombination  with  a  pair  of  surviving  equations.  Mutations 
involve  adding,  removing,  or  altering  a  term  in  the  equation. 

The  SR  implementation  we  used  in  this  study  incorporates  multiobjective  optimization  to  perform  search  [26,  27]. 
The  errors  and  sizes  of  the  models  in  the  population  are  computed.  Size  is  defined  as  the  total  number  of  operators 
and  operands  in  the  equation.  The  Pareto  front  of  models  with  least  error  and  smallest  size  is  determined,  and  models 
off  this  front  are  discarded.  New  models  are  generated  by  randomly  choosing  surviving  models  on  the  front.  When 
run  against  a  dataset  generated  by  a  single  scale-free  network  composed  of  N  =  250  nodes,  the  best  equation  found2, 
in  terms  of  balancing  complexity  and  accuracy,  was 


TU  -  Lij 


1  +  In 


N-kf 


Lj j  +  kj  +  kj  ■ 


■  Cl  + 


kj' 


Ln.  +  hk 


IJ 


l^J 


(3) 


This  equation  achieved  a  high  correlation  coefficient  of  R  —  0.88  when  compared  with  the  simulated  7);.  We  remark 
that  Eq.  (3)  seems  plausible  in  nature:  the  dominant  variable  is  the  distance  L,j  between  i  and  j,  which  is  intuitive  for 
the  transaction  model.  The  degrees  of  i  and  j,  the  clustering  of  j  and  global  network  properties  N  and  the  network 
radius  then  comprise  a  small,  logarithmic  correction  to  L,j.  Other  variables  did  not  factor  into  this  function. 


2.4  Tampered  networks 

To  test  the  ability  of  the  SR  model  to  indicate  the  presence  of  a  hidden  node,  we  need  access  to  a  ground  truth  test  bed. 
To  create  such  a  test  using  our  model  networks  (Sec.  2.1),  we  generate  a  new  scale-free  network,  simulate  transactions 
on  it  (Sec.  2.2),  then  choose  one  or  more  nodes  to  hide;  they  are  removed  before  computing  the  network  structural 

2Note  that  SR  was  prevented  from  using  numerical  prefactors  to  enforce  greater  structural  diversity  in  models  along  the  Pareto  front. 
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metrics  and  the  information  flow  times  (Eq.  (1)).  In  this  way  hidden  nodes  fully  participate  in  the  flow  of  packets, 
but  otherwise  they  are  unknown  to  the  symbolically  regressed  model.  Comparing  the  SR  model’s  delay  predictions 
(Eq.  (3))  to  the  simulated  delay  times,  we  can  a  posteriori  search  for  systematic  errors  among  the  neighbors  of  the 
hidden  node  or  nodes. 

To  measure  the  effects  of  the  hidden  node  we  study  three  quantities.  The  first  is  the  coefficient  of  determination 
R2  between  the  7’,/s  measured  for  the  non-hidden  node-pairs  from  the  transaction  simulations  and  the  predicted  7’,/s 
from  the  SR  model,  where  R  is  the  Pearson  correlation  coefficient.  If  the  value  of  R2  drops  significantly  compared  to 
R2  for  the  untampered  network,  then  that  supports  the  ability  for  us  to  detect  missing  or  hidden  nodes. 

Beyond  this  global  measure  we  also  use  two  local  measurements  to  assess  the  effect  a  hidden  node  has  on  a  single 
non-hidden  node  i: 


RMSE,  = 


Bias,-  =E,  |Tpred  -  7\°),sl , 
'  1  u  </  1 


(4) 

(5) 


where  the  expectation  ,B;  [•]  runs  over  all  (non-hidden)  nodes  j  4  i  that  are  connected  to  i  ( L,y  <  oo),  and  rpied  and  7/hs 
denote  the  flow  time  predicted  by  the  SR  model  and  the  actual  flow  time  observed  from  the  simulations,  respectively. 
The  RMSE,  captures  the  magnitude  of  the  SR  model’s  error  for  node  ;,  while  Bias,  measures  whether  it  consistently 
over-  or  under-estimated  7/.  A  positive  bias  indicates  that  information  is  traveling  faster  than  expected  by  the  SR 
model. 


3  Results 

Our  first  experiment  consisted  of  measuring  the  change  in  the  coefficient  of  determination  R2  for  tampered  scale-free 
networks  (Sec.  2.4).  To  do  this  we  first  generated  an  ensemble  of  100  untampered  scale-free  networks  (Sec.  2.1)  and 
simulated  transactions  on  each  (Sec.  2.2).  We  applied  the  SR  model  of  7/  to  these  networks  (Eq.  (3))  and  computed 
R2  for  each.  As  shown  in  Fig.  2A,  the  distribution  of  R2  was  sharply  peaked  around  R1  ~  0.77,  the  value  that  the  SR 
model  achieved  on  its  training  data  (Sec.  2.3).  The  narrowness  of  this  distribution  indicates  that  the  SR  model  has 
useful  predictive  power. 

Next  we  generated  another  ensemble  of  scale-free  networks  and  simulated  transactions,  but  now  we  tampered  with 
each  network  by  hiding  one  random  hub3.  We  see  a  significant  drop  in  accuracy  (lower  R2)  for  the  SR  model  on  these 

3  We  take  a  hub  to  be  a  randomly  chosen  node  that  was  introduced  in  the  first  20%  of  the  network  growth  process,  taking  advantage  of  preferential 
attachment's  early-mover-advantage. 
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Figure  2:  Detecting  hidden  nodes  in  scale-free  networks.  (A)  The  distribution  of  correlation  coefficient  comparing  the  simulated 
transaction  times  and  those  predicted  by  the  symbolically  regressed  model.  We  see  for  networks  drawn  from  the  same  ensemble 
as  the  training  network  (untampered)  that  R2  is  sharply  peaked  around  0.78.  The  distribution  of  R2  changes  significantly  after  a 
single  node  removal  (Mann-Whitney  U  test  p  <  1(T10,  Cohen’s  d  =  -0.898).  (inset)  The  median  R2  decreases  as  more  high-degree 
nodes  are  removed.  (B)  The  likelihood  of  detecting  a  single  hidden  node  by  the  change  in  R2.  Using  the  distribution  of  R2  for 
the  untampered  network  as  the  null  model,  we  standardize  the  R2  distribution  for  networks  with  one  hidden  node.  Looking  at  this 
distribution  we  see  that  nearly  60%  of  the  time  we  can  successfully  detect  that  a  relatively  high  degree  node  is  absent  with  95% 
confidence.  If  we  consider  lower  degree  hidden  nodes,  which  are  more  challenging  to  discover  as  they  tend  to  participate  less  in 
information  flow,  this  drops  to  approximately  25%.  Distributions  shown  in  panel  A  were  computed  using  kernel  density  estimation; 
each  curve  was  truncated  at  the  largest  value  observed  in  the  ensemble  data  to  indicate  the  empirical  ranges  of  R2. 


tampered  networks  (Mann-Whitney  U  test  p  <  10  l0,  Cohen’s  d  =  -0.898),  indicating  that  we  are  likely  to  see  the 
effect  of  a  hidden  node  by  a  drop  in  the  accuracy  of  the  model.  Hiding  multiple  hubs  leads  to  even  greater  losses  in 
accuracy  (Fig.  2A  and  inset). 

However,  the  comparisons  shown  in  Fig.  2A  are  for  an  ensemble  of  networks,  while  practically  we  seek  to  detect 
the  presence  of  a  missing  node  in  a  single  network.  To  determine  if  this  is  feasible  we  standardized  the  distribution  of 
R2  for  the  ensemble  of  networks  with  a  single  hidden  hub  relative  to  the  untampered  ensemble,  giving  a  z-score  z(R2) 
for  each  tampered  network.  Large  negative  values  of  z  indicate  a  statistically  significant  drop  in  R2.  The  cumulative 
probability  distribution  shown  in  Fig.  2B  tells  us  that  nearly  60%  of  the  tampered  ensemble  has  z(R2)  <  -1.6449, 
meaning  that  nearly  60%  of  the  time  we  can  determine  with  95%  confidence  that  a  single  network  is  missing  a  hub. 
The  50%  confidence  limit,  z  <  0,  corresponding  to  how  well  we  can  beat  a  coin-flip,  is  nearly  90%. 

These  results  indicate  that  the  presence  of  a  single  hidden  node  can  often  be  detected.  An  important  question, 
however,  is  whether  or  not  we  can  identify  the  location  of  this  hidden  node.  To  study  this,  we  computed  the  errors 
and  biases  (Eqs.  (4)  and  (5))  of  each  node  in  a  tampered  network.  If  the  neighbors  of  the  hidden  node  show  significant 
error  or  bias,  then  that  means  we  can  determine  the  location  of  the  hidden  node.  We  show  a  network  diagram  of  one 


14 

Approved  for  public  release;  distribution  is  unlimited. 


tampered  scale-free  network  in  Fig.  3A.  Node  color  and  size  is  proportional  to  RMSE  and  the  hidden  node  is  indicated 
with  a  diamond  (o).  We  observe  that  many  neighbors  of  the  hidden  node  have  far  greater  RMSE  than  other  nodes  in 
the  network.  This  is  exactly  the  evidence  needed  to  estimate  the  hidden  node’s  location  within  the  network  topology. 

To  determine  if  these  results  are  significant,  we  computed,  for  the  ensemble  of  scale-free  networks  with  one  hidden 
hub,  distributions  of  RMSE  and  Bias  separately  for  neighbors  of  the  missing  hub,  next-nearest  neighbors,  and  other 
nodes.  The  RMSE  was  significantly  larger  (Mann- Whitney  U  test  p  <sc  ICE10,  Cohen’s  d  =  4.69)  for  neighbors  of 
the  hidden  node  across  the  entire  ensemble  (the  median  error  for  neighbors  was  «  4.33  timesteps  compared  with  1.1 
timesteps  for  other  nodes).  Next-nearest  neighbors,  those  nodes  two  steps  away  from  the  hidden  node  in  the  original 
topology,  did  not  show  a  significant  change  in  error  relative  to  other  nodes  in  the  network  ( p  =  0.052).  However,  a 
number  of  outliers  do  overlap  with  the  RMSE  values  for  the  nearest  neighbors,  indicating  that  longer-range  network 
effects  are  rare  but  do  occur. 

At  the  same  time,  the  Bias  was  also  positively  skewed  for  neighbors  of  the  hidden  node  (median  Bias  »  2.1 
timesteps),  indicating  that  our  intuition  from  Fig.  1  was  correct.  Next-nearest  neighbors  have  no  discernible  bias 
(median  Bias  »  0.03),  while  other  nodes  actually  have  a  slightly  negative  bias  (median  Bias  *  -0.18),  indicating  the 
information  in  the  rest  of  the  network  actually  travels  slightly  slower  than  expected  due  to  the  hidden  node  (however, 
a  zero  bias  cannot  be  ruled  out  for  this  group). 

4  Discussion 

We  have  shown  that  the  presence  of  hidden  nodes  can  be  inferred  by  modeling  how  network  topology  influences  a 
dynamical  process  overlaying  that  network.  We  focused  on  an  idealized  information  flow  dynamics  but  there  is  great 
potential  for  applying  this  to  other  model  dynamics.  For  future  work,  we  intend  to  use  our  methodology  alongside 
real  world  data  on  information  cascades  and  other  dynamical  processes  and  to  further  study  how  different  classes  of 
network  topologies  help  or  hinder  the  node  discovery  process.  We  also  plan  to  better  incorporate  the  directionality  of 
information  flow,  which  was  neglected  here  by  the  absolute  value  used  in  the  equation  for  7’;/. 

It  is  not  particularly  surprising  that  perturbing  a  network,  which  then  leads  to  perturbed  metrics  such  as  those  used 
in  Eq.  2,  will  lead  to  a  reduction  in  the  accuracy  of  an  SR  model  (e.g.,  Eq.  3).  This  was  shown  in  Fig.  2.  However,  we 
have  shown  (Fig.  3)  that  the  loss  in  accuracy  is  localized  and  correlates  with  the  position  of  the  defect,  indicating  that 
we  are  extracting  useful  information  and  not  merely  randomizing  the  terms  within  the  SR  model’s  functional  form. 

More  generally,  looking  for  discrepancies  in  the  speed  of  information  flow  (or  other  quantities)  can  be  used  to 
study  not  just  missing  nodes  but  other  defects  and  errors,  such  as  missing  links  or  false  links  that  incorrectly  appear 


15 

Approved  for  public  release;  distribution  is  unlimited. 


Figure  3:  Identifying  the  location  of  a  missing  node.  (A)  A  scale-free  network  of  250  nodes  with  a  single  node  hidden  (o).  The 
neighbors  of  the  hidden  node  are  indicated  with  □  while  other  nodes  are  o.  The  size  and  color  of  each  node  is  proportional  to 
the  rms  error  of  the  information  transfer  time  from  that  node  to  every  other  node  in  the  network.  We  see  that  the  neighbors  of 
the  missing  node  consistently  have  higher  errors  than  the  rest  of  the  network.  (B)  The  distributions  of  error  and  bias  across  the 
ensemble  of  tampered  networks  for  the  hidden  node’s  neighbors,  next-nearest  neighbors,  and  non-neighbors.  The  median  error 
for  neighbors  is  approximately  4.33  timesteps  while  for  non-neighbors  it  is  approximately  1.11  timesteps.  The  distributions  are 
significantly  different  (Mann-Whitney  U  test  p  <K  1(T10,  Cohen’s  d  =  4.69).  The  next-nearest  neighbors  have  errors  comparable 
to  non-neighbors  (p  =  0.052)  but  we  see  a  greater  number  of  outliers  skewing  upward.  This  indicates  that  there  are  some  network 
effects  in  how  errors  propagate,  but  they  are  relatively  rare.  Likewise,  we  see  positive  bias  for  neighbor  nodes,  significantly  higher 
than  for  non-neighbors  (Mann-Whitney  U  test  p  <K  1(L10,  Cohen’s  d  =  3.37).  This  positive  bias  indicates  that  information  spreads 
faster  from  (or  to)  neighbors  of  the  hidden  node  than  the  SR  model  expects,  supporting  the  intuition  behind  Fig.  1  To  control  for 
the  centrality  of  the  hidden  node,  in  each  realization  the  hidden  node  was  the  node  with  the  fifth  highest  degree. 


in  the  network,  false  nodes  that  do  not  actually  exist,  the  splitting  of  a  true  node  into  multiple  false  nodes,  or  the 
merging  of  multiple  true  nodes  into  a  single  false  node.  Some  of  these  errors  will  likely  prove  more  challenging  to 
detect  than  others,  but  the  benchmarking  procedure  we  have  introduced  here  may  offer  some  hope  towards  tackling 
these  problems. 
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3  Symbolically  regressing  brain  imaging  data. 

Note:  All  of  the  work  on  neuroimaging  data  conducted  as  part  of  this  award  used  data  collected  by 
a  separate  group — the  IMAGEN  consortium — before  the  award  commenced.  None  of  the  members 
of  our  group  were  involved  in  this  data  collection. 

Brain  dynamics  are  immensely  complex  across  space  and  time  and  physiological  subsystem. 
Thus,  it  is  extremely  unlikely  that  any  assumptions  can  be  made  a  priori  about  the  structure  of 
these  dynamics.  We  thus  have  adapted  symbolic  regression  for  modeling  data  generated  from 
functional  MRI  data  sets:  we  induced  models  that  predict  behavior  as  a  function  of  brain  dynamics. 
In  the  manuscript  that  follows  [HI  we  show  that  there  are  many  previously  unknown  nonlinear 
relationships  between  brain  regions  that  can  be  predictive  of  behavior. 

Furthermore,  we  show  that  such  relationships  can,  at  least  in  the  specific  conditions  investigated 
here,  consistently  predict  behavioral  tendencies.  The  method,  in  brief,  was  a  hybrid  method  that 
employs  symbolic  regression  for  feature  construction  (it  finds  nonlinear  terms  that  are  weakly 
predictive  of  the  outcome  of  interest)  and  more  traditional  regression  methods  for  feature  selection 
(optimizing  coefficients  for  those  terms). 

Many  of  the  models  found  in  this  way  concord  with  those  discovered  using  orthogonal  meth¬ 
ods,  providing  some  confidence  in  the  other  models  found  by  our  method.  Most  notably,  symbolic 
regression  found  that  differences  in  the  nonlinear  relationships  between  brain  regions  implicated  in 
thist,  reward,  and  emotion  accurately  predict  whether  the  adolescent  participants  from  whom  the 
data  was  drawn  regularly  consume  alcohol  or  not. 

Two  additional  manuscripts  are  in  preparation.  The  first  is  an  attempt  to  predict  a  different  be¬ 
havioral  tendency  (smoking  or  non-smoking)  using  the  same  method.  The  second  is  an  attempt  to 
predict  how  an  individual’s  competence  degrades  with  adversity.  Predictions  about  both  behavioral 
traits  is  made  directly  from  models  trained  against  brain  imaging  data,  again  collected  by  groups 
not  associated  with  this  award. 

In  future,  such  methods  could  be  used  to  rapidly  predict  other  behavioral  traits.  More  impor¬ 
tantly,  it  may  be  able  to  predict  behavioral  traits  that  have  yet  to  manifest  in  the  participants,  thus 
providing  an  opportunity  for  proactive  treatment  and/or  counseling. 

3.1  Relevance  for  U.S.  defense  and  security. 

Predicting  the  behavior  of  warfighters  in  the  field  is  of  extreme  interest,  especially  before  they 
have  been  delivered  into  a  theater  of  war.  Similarly,  it  would  be  of  extreme  utility  to  assess  the 
mental  state  of  warfighters  during  deployment  without  having  to  distract  them  with  explicit  re¬ 
quests  for  updates.  The  method  outlined  in  the  manuscript  below  could  be  adapted  for  inferring 
current  behavior  and/or  future  degradation  in  behavior  directly  from  neuroimaging.  Current  MRI 
technologies  rule  out  real-time  assessment,  but  advances  in  mobile  MRI,  tensor  diffusion  imaging, 
and/or  EEG  may  make  such  assessment  tractable  in  the  near  future. 

3.2  Allgaier  et  al.  “Nonlinear  functional  mapping...”  (2015). 

A  technical  manuscript  describing  the  symbolic  regression  of  neuroimaging  and  behavioral  data 
sets  follows. 
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Abstract 

The  field  of  neuroimaging  has  truly  become  data  rich,  and  novel  analytical  methods  capable  of  gleaning  meaning¬ 
ful  information  from  large  stores  of  imaging  data  are  in  high  demand.  Those  methods  that  might  also  be  applicable 
on  the  level  of  individual  subjects,  and  thus  potentially  useful  clinically,  are  of  special  interest.  In  the  present  study, 
we  introduce  just  such  a  method,  called  nonlinear  functional  mapping  (NFM),  and  demonstrate  its  application  in  the 
analysis  of  resting  state  fMRI  (functional  Magnetic  Resonance  Imaging)  from  a  242-subject  subset  of  the  IMAGEN 
project,  a  European  study  of  adolescents  that  includes  longitudinal  phenotypic,  behavioral,  genetic,  and  neuroimaging 
data.  NFM  employs  a  computational  technique  inspired  by  biological  evolution  to  discover  and  mathematically  char¬ 
acterize  interactions  among  ROI  (regions  of  interest),  without  making  linear  or  univariate  assumptions.  We  show  that 
statistics  of  the  resulting  interaction  relationships  comport  with  recent  independent  work,  constituting  a  preliminary 
cross-validation.  Furthermore,  nonlinear  terms  are  ubiquitous  in  the  models  generated  by  NFM,  suggesting  that  some 
of  the  interactions  characterized  here  are  not  discoverable  by  standard  linear  methods  of  analysis.  We  discuss  one  such 
nonlinear  interaction  in  the  context  of  a  direct  comparison  with  a  procedure  involving  pairwise  correlation,  designed  to 
be  an  analogous  linear  version  of  functional  mapping.  We  find  another  such  interaction  that  suggests  a  novel  distinction 
in  brain  function  between  drinking  and  non-drinking  adolescents:  a  tighter  coupling  of  ROI  associated  with  emotion,  re¬ 
ward,  and  interoceptive  processes  such  as  thirst,  among  drinkers.  Finally,  we  outline  many  improvements  and  extensions 
of  the  methodology  to  reduce  computational  expense,  complement  other  analytical  tools  like  graph-theoretic  analysis, 
and  allow  for  voxel  level  NFM  to  eliminate  the  necessity  of  ROI  selection. 
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resting  state  fMRI,  modeling,  nonlinear,  machine  learning,  genetic  programming,  symbolic  regression 
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1.  Introduction 

Many  advances  in  our  understanding  of  brain  func¬ 
tion  have  been  achieved  through  analysis  of  fMRI  data. 
Though  the  BOLD  (blood  oxygen  level  dependent)  signal 
obtained  from  fMRI  is  a  proxy,  physiological  confounds 
such  as  breathing  and  heart  rate  are  separable  from  neuronal- 
induced  signal,  as  demonstrated  in  Birn  et  al.  (2009).  Inter¬ 
subject  differences  in  vascular  reactivity  can  be  modeled 
as  shown  in  Murphy  et  al.  (2011),  and  BOLD  has  been 
directly  shown  to  provide  a  reliable  measure  of  neuronal 
activity  in  specific  circumstances,  as  in  Mukamel  et  al. 
(2005).  The  many  years  of  successful  research  before  and 
since  support  that  assessment.  Accomplishments  include 
localization  of  regions  responsible  for  particular  tasks,  such 
as  episodic  memory  in  Nolde  et  al.  (1998)  and  human  face 
recognition  in  Kanwisher  et  al.  (1999),  assessment  of  the 
risk  of  postoperative  motor  defect  in  patients  with  tumors 
in  Mueller  et  al.  (1996),  analysis  of  the  effects  of  acupunc¬ 
ture  in  Hui  et  al.  (2000),  and  recently,  identification  of 
neural  markers  for  both  current  and  future  alcohol  use 
among  adolescents  in  Whelan  et  al.  (2012)  and  Whelan 
et  al.  (2014). 

These  examples,  and  indeed  the  majority  of  fMRI  stud¬ 
ies,  make  use  of  the  GLM  (general  linear  model)  to  deter¬ 
mine  neural  correlates  for  various  tasks  and  stimulus  re¬ 
sponses.  Though  typical  analyses  have  been  performed  at 
the  group  level  with  a  univariate  approach,  other  recent 
work  reported  in  Rio  et  al.  (2013)  has  extended  the  capa¬ 
bilities  of  the  GLM  to  analyze  multivariate  signal  in  the 
Fourier  domain  to  reduce  confounds  from  time-correlated 
noise,  thus  improving  the  suitability  of  the  GLM  for  sub¬ 
ject  level  analysis.  Despite  these  advances,  however,  the 
GLM  can  only  confirm  hypothesized  nonlinear  models  of 
function,  not  discover  them. 

Group-level  inferences  from  fMRI  have  also  been  per¬ 
formed  using  linear  ICA  (independent  component  analy¬ 
sis),  as  described  in  Calhoun  et  al.  (2001).  Though  ICA 
and  the  GLM  can  be  used  in  conjunction,  for  example  in 
Liu  et  al.  (2010)  to  investigate  the  neural  effects  of  stimu¬ 
lation  of  a  particular  acupoint,  ICA  is  particularly  useful 
in  circumstances  that  preclude  the  use  of  the  GLM,  such 
as  the  analysis  of  resting-state  data,  for  which  there  is  no 
task  or  stimulus  regressor.  Covarying  networks  have  been 
suggested  by  ICA  of  resting-state  fMRI  in  Smith  et  al. 
(2009),  and  functional,  hierarchical  classification  of  these 
networks  has  been  automated  through  HCA  (hierarchical 
cluster  analysis)  of  aggregated  experimental  metadata  in 
Laird  et  al.  (2011).  However,  it  was  determined  early  on, 
for  example  in  McKeown  and  Sejnowski  (1998),  that  non¬ 
linear  interactions  within  the  brain  need  to  be  addressed 
in  order  to  properly  determine  functional  architecture. 

Although  ICA  algorithms  that  employ  nonlinear  mix¬ 
ing  functions  exist,  severe  restrictions  on  those  functions 
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are  required  to  avoid  non-uniqueness  of  solutions,  as  ex¬ 
plained  in  Hyvarinen  and  Pajunen  (1999).  Due  to  this 
failing,  other  methodologies  have  been  employed  in  the 
attempt  to  account  for  nonlinearity.  Examples  include 
various  forms  of  nonlinear  regression,  as  in  Kruggel  et  al. 
(2000),  and  dynamic  causal  modelling,  as  described  in  Fris- 
ton  et  al.  (2003).  In  each  of  these,  a  particular  nonlinear 
form  must  be  posited  a  priori,  and  thus  the  capability  to 
discover  previously  unknown  nonlinear  interactions  within 
the  brain  is  diminished.  As  a  result,  a  fuller  picture  of  the 
nature  of  intra-  and  inter-network  functional  connectivity 
within  the  brain  is  missing  from  the  literature. 

Here  we  introduce  a  methodology  designed  to  accom¬ 
plish  such  a  mathematical  characterization,  provide  insight 
at  the  group,  subject,  and  ROI  levels,  and  to  avoid  lin¬ 
ear  and  univariate  assumptions.  With  some  modification, 
analysis  of  higher  dimensional  data  is  likely  attainable, 
allowing  for  eventual  application  at  the  voxel  scale  and 
eliminating  the  necessity  of  ROI  selection.  After  stan¬ 
dard  preprocessing  (slice-timing  and  motion  correction, 
normalization,  smoothing,  etc.),  our  procedure  consists  of 
ROI  selection,  inter-ROI  symbolic  regression  (a  model-free 
form  of  nonlinear  regression),  accomplished  by  an  evolu¬ 
tionary  algorithm  called  genetic  programming  (GP;  a  form 
of  stochastic  optimization),  and  statistical  analysis  of  the 
resulting  models.  We  demonstrate  our  technique  on  a  242- 
subject  collection  of  resting-state  data  from  the  IMAGEN 
project,  though  analysis  of  task  or  stimulus  experiments 
can  be  accomplished  with  little  or  no  modification.  The 
IMAGEN  project  is  described  in  detail  in  Schumann  et  al. 
(2010). 

We  organize  the  paper  as  follows.  In  Section  2,  we 
discuss  the  data  and  selection  of  ROI,  provide  some  back¬ 
ground  on  GP,  and  describe  the  procedural  details  of  NFM 
by  symbolic  regression.  In  Section  3,  we  report  results  of 
applying  the  technique  to  the  IMAGEN  data,  including 
statistical  and  hierarchical  visualizations,  comparison  with 
previous  results  for  cross-validation,  effects  of  nonlinearity, 
and  an  example  of  group-level  variation.  We  discuss  the 
results  and  potential  applications  of  the  technique  in  Sec¬ 
tion  4,  and  conclude  the  paper  in  Section  5. 

2.  Materials  and  methods 

In  this  section,  we  first  briefly  describe  the  source  of 
the  data  for  our  study,  and  then  provide  the  details  of 
ROI  selection  that  allow  for  comparison  with  recent  work. 
We  then  provide  some  background  on  the  GP  algorithm 
in  general  and  the  specific  implementation  employed  here, 
along  with  the  method  by  which  it  is  applied  to  BOLD  sig¬ 
nal  time  series  extracted  from  the  selected  ROI.  Finally, 
we  describe  the  statistical  technique  used  to  interpret  the 
roughly  quarter  of  a  million  mathematical  models  that  re¬ 
sult  from  the  application  of  GP  to  all  52  ROI  time  series 
extracted  from  each  of  the  242  subjects. 
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2.1.  Data 

The  data  investigated  here  are  a  subset  of  the  fMRI 
scans  from  the  IMAGEN  study,  a  European  research  project 
with  the  goal  of  better  understanding  teenage  psychologi¬ 
cal  and  neurobiological  development.  The  project  is  longi¬ 
tudinal,  and  utilizes  several  forms  of  high  and  low-tech  ex¬ 
perimental  protocols  including  self-report  questionnaires, 
behavioral  assessment,  interviews,  neuroimaging,  and  blood 
sampling  for  genetic  analyses.  Each  of  the  2000  participat¬ 
ing  adolescents  was  14  when  entering  the  study,  which  it¬ 
self  commenced  in  late  2007,  and  data  collection  continues 
today. 

More  specifically,  the  data  for  the  present  study  are  6- 
minute  resting-state  fMRI  time  series  of  242  of  the  adoles¬ 
cent  subjects  who  were  asked  to  keep  their  eyes  open  while 
in  the  scanner,  but  were  presented  with  no  other  task  or 
stimulus.  To  allow  for  comparison  with  previous  work,  lo¬ 
cations  of  the  ROI  were  chosen  based  on  results  from  Laird 
et  al.  (2011),  in  which  statistical  analysis  across  thousands 
of  previous  imaging  studies  (both  stinmlus/task-based  and 
resting-state)  was  used  to  identify  networks  of  brain  re¬ 
gions  that  tend  to  activate  together,  termed  ICN  (intrin¬ 
sic  connectivity  networks).  The  ICN  were  determined  by 
ICA,  from  which  ^-statistic  maps  were  derived.  To  select 
ROI  for  this  study,  a  z-statistic  threshold  was  set  for  each 
ICN  to  determine  the  number  of  regions  in  the  network, 
and  ROI  were  defined  as  rough  spheres  with  radii  of  3  vox¬ 
els  (9mm)  and  centered  at  the  location  of  peak  z-statistic 
in  each  region. 


Figure  1:  ROI  Selection.  (Red)  ROI  from  within  the  default  mode 
network,  with  radii  of  3  voxels  and  centers  corresponding  to  the 
highest  ^-statistics  (green)  in  each  region  as  determined  in  Laird 
et  al.  (2011). 

We  provide  a  cut-out  illustrating  ROI  selection  for  the 
default  mode  network  (ICN  13)  in  Figure  1,  and  Figure 
2  contains  axial  cross  sections  showing  many  of  the  ROI 
derived  from  the  18  non-artifactual  ICN  in  Laird  et  al. 


(2011).  In  Appendix  A,  Table  A.l  we  list  all  52  ROI  by 
number,  give  their  anatomical  names,  indicate  the  ICN 
from  within  which  they  were  defined,  and  provide  visual 
representations  of  their  locations  within  the  brain. 

Subsequent  to  ROI  definition,  a  gray  matter  mask  was 
applied  to  assure  that  only  appropriate  voxels  were  con¬ 
tained  within  each  ROI.  In  some  cases  this  resulted  in 
a  considerable  reduction  of  ROI  voxels,  but  the  majority 
maintained  the  full  complement  of  about  100  voxels.  For 
each  of  the  242  subjects,  time  series  were  extracted  from 
each  of  the  52  resulting  ROI  by  averaging  the  BOLD  sig¬ 
nal  over  all  voxels  within  the  ROI.  These  time  series  then 
form  the  input  to  the  GP  algorithm. 


Figure  2:  Visualization  of  ROI.  Axial  cross  sections  showing  many 
of  the  ROI  derived  from  the  ICN  in  Laird  et  al.  (2011). 

2.2.  Genetic  programming 

GP  is  a  biologically  inspired,  population-based  ma¬ 
chine  learning  algorithm.  It  is  most  commonly  employed 
for  symbolic  regression:  the  algorithm  searches  for  models 
explaining  some  quantity  of  interest  (e.g.,  average  BOLD 
signal  from  an  ROI  in  the  brain)  as  a  function  of  some 
other  possibly  related  observable  quantities,  statistics,  or 
summary  data  (e.g.,  BOLD  signals  from  other  ROI).  The 
algorithm  proceeds  by  evolving  the  functional  forms  of  a 
population  of  potential  models,  which  are  initially  con¬ 
structed  at  random  from  user-specified  mathematical  build¬ 
ing  blocks  (available  variables,  arithmetic  functions,  pa¬ 
rameter  constants,  etc.).  In  brief,  the  models  that  better 
explain  the  data  produce  more  offspring,  leading  to  a  grad¬ 
ual  reduction  of  error  within  the  population.  We  show  a 
representative  set  of  models  produced  by  this  approach  in 
Figure  3(a).  A  key  advantage  of  the  technique  is  that  no 
assumption  (e.g.,  linearity)  is  imposed  on  the  form  of  solu¬ 
tions,  other  than  the  choice  of  building  blocks  from  which 
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Best  Solutions  of  Different  Sizes 
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Solution  Details  (calculated  on  validation  data) 
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Figure  3:  Screen  shot  of  the  GP  package  Eureqa  during  a  search  for  models  of  the  activity  in  ROI  19  in  a  single  subject,  as  a  function  of 
activity  in  the  other  51  regions,  (a)  The  current  set  of  models  along  the  Pareto  front  of  accuracy  vs.  parsimony,  shown  in  (d)  where  each 
point  represents  a  model  and  the  red  point  represents  the  highlighted  model,  (b)  Data  from  ROI  19  (points)  over  the  6-minute  time  series 
for  this  subject  (x-axis  in  scans,  2  seconds  each).  The  highlighted  model  is  shown  in  red,  and  statistics  for  this  model’s  fit  appear  in  (c). 


they  can  be  made  (we  use  arithmetic  operations  in  the 
present  work). 

Typically,  some  measure  of  error  (e.g.,  root  mean  square 
error)  constitutes  a  model’s  explanatory  fitness,  and  some 
measure  of  its  size  (e.g.,  number  of  operators,  constants, 
and  variables  in  the  equation)  represents  its  parsimony. 
The  next  generation  of  potential  models  is  obtained  by 
mutation  (e.g.,  a  single  change  of  variable  or  operator) 
and  recombination  (i.e.,  swapping  of  function  components 
between  models)  of  the  current  set  of  non-dominated  solu¬ 
tions:  those  models  for  which  no  simpler  model  in  the  pop¬ 
ulation  has  less  error.  This  set  of  non-dominated  models 
is  said  to  approach  the  ideal  Pareto  front  of  fitness  versus 
parsimony  as  the  population  evolves.  An  important  aspect 
of  GP  is  that  the  result  of  a  single  search  is  this  entire  set 
of  potential  models,  providing  a  trove  of  information  for 
statistical  analysis.  Figure  3  is  a  screenshot  of  the  off-the- 
shelf  GP  package  Eureqa  from  Schmidt  and  Lipson  (2009) 
performing  a  search  (Eureqa  version  0.97  Beta  was  used 
to  generate  the  results  reported  in  this  study) . 

To  apply  GP  to  the  fMRI  data,  for  each  of  the  242  sub¬ 
jects  we  extract  a  single  BOLD  signal  time  series  from  each 


of  the  52  selected  ROI  by  averaging  over  the  voxels  within 
that  ROI.  Then  the  GP  algorithm  is  run  52  times,  one 
for  each  ROI,  using  all  other  ROI  as  potential  explanatory 
variables.  Note  that  the  algorithm  has  no  knowledge  of 
the  hypothesized  networks  from  which  these  regions  were 
chosen. 

We  describe  the  computational  expense  of  the  algo¬ 
rithm  in  terms  of  core-hours,  i.e.,  the  number  of  hours  re¬ 
quired  for  a  single  processor  core  to  perform  the  necessary 
computation.  Specifically,  twelve  core-hours  of  search  were 
performed  for  each  region,  amounting  to  624  core-hours 
per  subject,  and  over  17  total  core-years  of  computation 
were  required  for  the  population  of  242  subjects.  This 
yielded  roughly  12  thousand  Pareto  fronts  comprised  of  a 
quarter  million  models  for  statistical  analysis. 

The  results  of  this  analysis  characterize  the  entire  pop¬ 
ulation  of  242  subjects.  Alternatively,  results  can  be  ag¬ 
gregated  over  phenotypic  groups  to  produce  group-level 
characterizations,  or  many  GP  searches  can  be  run  for 
a  single  individual  to  produce  a  subject-level  characteri¬ 
zation.  We  report  results  of  population-  and  group-level 
analyses  in  Section  3,  and  discuss  an  example  subject-level 
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(a)  Interaction  map 


Figure  4:  Functional  interaction  map.  (a)  Interaction  map  across  all  242  subjects,  and  (b)  map  of  RSD  (relative  standard  deviation)  of  the 
interaction  rates  over  100  subsamples  with  100  randomly  selected  subjects  each.  Solid  outlines  indicate  ICN  and  dashed  outlines  indicate 
functional  groupings  of  ICN  from  Laird  et  al.  (2011). 


analysis  in  Appendix  B. 

2.3.  Analysis 

The  output  of  the  GP  algorithm  poses  a  challenge  for 
interpretation.  Here  we  present  a  coarse  statistical  anal¬ 
ysis  of  this  rich  mathematical  characterization.  For  each 
ROI,  we  count  the  number  of  models  for  that  ROI,  across 
the  Pareto  fronts  for  all  242  subjects,  that  have  a  partic¬ 
ular  (other)  region  on  the  right-hand  side  of  the  equation. 
We  compute  this  count  for  each  of  the  other  51  regions. 

For  example,  consider  the  GP  search  for  models  of  ROI 
19  within  a  single  subject  illustrated  in  Figure  3.  Upon 
completion,  all  20  of  the  models  along  the  Pareto  front 
for  this  subject  had  at  least  one  term  containing  ROI  9, 
and  17  models  had  terms  containing  ROI  20.  In  the  sub¬ 
ject  pool  as  a  whole,  the  total  counts  are  2990  and  1984, 
respectively.  Specifically,  of  the  roughly  5000  models  for 
ROI  19  across  all  subjects,  about  60%  have  terms  contain¬ 
ing  ROI  9,  and  about  40%  have  terms  containing  ROI  20. 
Note  that  these  frequencies  are  not  properly  normalized, 
because  most  models  contain  several  ROI.  Thus  we  nor¬ 
malize  by  the  sum  of  the  counts  for  all  ROI.  In  the  case  of 
ROI  19,  this  sum  is  22016. 

The  result  is  a  vector  for  each  ROI  that  describes, 
in  a  statistical  sense,  its  relative  dependence  on  each  of 
the  other  regions.  We  interpret  this  vector  as  a  distri¬ 
bution  of  likely  interaction,  and  define  the  computed  val¬ 
ues  to  be  relative  interaction  rates  (IR).  Note  that  both 
linear  and  nonlinear  interactions,  as  well  as  weakly  and 
strongly  weighted  basis  functions,  are  counted  equally.  We 
form  an  interaction  map  by  stacking  these  IR  row  vectors 


to  visualize  interaction  across  all  52  ROI,  shown  in  Fig¬ 
ure  4(a).  The  value  in  row  19  column  9,  for  example,  is 
2990/22016  ~  0.136,  depicted  as  a  yellow  square. 

Note  that  the  IR  map  is  not  symmetric  by  construc¬ 
tion  (though  it  appears  nearly  so),  and  indeed  the  value 
in  row  9,  column  19  is  0.148  /  0.136.  We  interpret  a  row 
of  the  IR  map  as  a  distribution  of  relative  dependence  of 
the  corresponding  ROI  on  each  of  the  other  regions.  We 
interpret  a  column,  on  the  other  hand,  as  a  measure  of  the 
influence  of  the  corresponding  ROI  on  each  of  the  other 
regions.  By  averaging  the  IR  map  with  its  transpose,  we 
produce  a  symmetric,  overall  IR  map  (not  shown)  that  can 
be  used  in  hierarchical  analysis.  We  examine  the  interac¬ 
tion  map,  and  provide  results  of  hierarchical  analysis,  in 
the  next  section. 

3.  Results 

Figure  4(a)  shows  the  interaction  map  generated  by 
the  normalized  frequency  analysis  of  the  NFM  procedure, 
summarizing  ROI  interaction  across  all  242  subjects.  To 
test  the  robustness  of  the  computed  interaction  map,  we 
form  100  random  subsamples  (with  replacement)  from  the 
pool  of  242  subjects,  each  with  100  subjects.  For  each 
sample,  we  perform  the  same  counting  procedure  to  pro¬ 
duce  the  interaction  map  corresponding  to  that  sample.  A 
heat  map  of  relative  standard  deviation  (RSD)  of  IR  over 
the  100  subsamples  is  shown  in  Figure  4(b). 

The  strong  block-diagonal  structure  of  the  interaction 
map  corresponds  directly  to  the  grouping  of  ROI  into  ICN. 
For  example,  regions  39-42,  which  form  a  partial  block 
in  the  figure,  are  the  four  ROI  that  make  up  the  default 
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mode  network  (ICN  13)  in  Laird  et  al.  (2011).  Robustness 
(across  subjects)  of  intra-network  interaction  is  supported 
by  the  matching  block-diagonal  structure  of  low  subsam¬ 
pling  RSD  (mean  intra-network  RSD  <  20%),  for  all  but 
ICN  2  (ROI  3-4)  and  ICN  5  (ROI  10-18).  In  addition 
to  the  strong  primary  block-diagonal  structure,  there  is  a 
secondary  structure  of  lighter  blocks  that  group  ICN  to¬ 
gether.  For  example,  regions  32-38  are  composed  of  the 
strong  blocks  32-33,  34-35,  and  36-38  (corresponding  to 
ICN  10,  11  and  12  respectively).  There  is  a  lighter  block 
structure  that  suggests  interaction  among  these  three  ICN 
which  are,  in  fact,  together  responsible  for  visual  pro¬ 
cessing.  The  secondary  structure  visible  for  regions  19-31 
is  comprised  of  ICN  6-8,  which  perform  motor  and  visu- 
ospatial  tasks.  Each  of  these  examples  shows  a  matching 
secondary  structure  of  moderate  subsampling  RSD  (mean 
inter- network  RSD  <  30%),  indicating  fairly  robust  inter¬ 
network  interaction  as  well. 

3.1.  Hierarchical  analysis 

To  further  illustrate  and  clarify  the  hierarchical  organi¬ 
zation  suggested  by  the  interaction  map,  we  generate  the 
dendrogram  in  the  top  of  Figure  5  by  HCA  (hierarchical 
cluster  analysis,  implemented  in  MATLAB  with  the  near¬ 
est  distance  algorithm),  using  the  reciprocal  of  the  over¬ 
all  IR  between  each  pair  of  ROI  as  the  distance  between 
them.  For  example,  ROI  1  and  2  have  an  approximate 
overall  IR  of  0.2,  and  thus  the  distance  between  them  is  5. 
We  emphasize  that  the  organization  of  ROI  into  networks, 
and  clustering  of  those  networks  into  functional  groups  de¬ 
scribed  in  Laird  et  al.  (2011),  are  both  captured  by  NFM. 
Some  examples: 

•  The  red  group  forms  the  visual  cluster.  ROI  32  and 
33,  the  lateral  occipital  cortices,  form  one  network 
(ICN  10),  while  ROI  34-35,  the  occipital  poles,  and 
ROI  36-38,  the  lingual  gyrus,  right  cuneus  and  right 
fusiform  gyrus,  respectively,  form  two  other  networks 
(ICN  11  and  12)  from  within  the  visual  cluster. 

•  Regions  39-42  (the  orange  group)  form  ICN  13,  the 
default  mode  network,  and  interact  with  ROI  4,  the 
ventromedial  prefrontal  cortex,  from  ICN  2. 

•  The  green  group  to  the  far  left  includes  all  but  one  of 
the  ROI  from  the  motor  and  visuospatial  complex. 
Interaction  of  this  complex  with  the  middle  cingulate 
cortex  (mCC,  ROI  9)  and  the  network  composed  of 
ROI  46  and  47,  thought  to  be  responsible  for  multiple 
cognitive  processes  such  as  attention  and  inhibition, 
is  indicated  as  well,  suggesting  that  this  interaction 
was  common  among  many  of  the  subjects. 

•  ICN  1  (ROI  1,2),  3  (ROI  5,6),  the  first  two  regions 
from  ICN  4  (ROI  7-9),  ICN  14  (ROI  43-45),  16  (ROI 
48,49),  and  17  (ROI  50,51)  are  also  indicated. 


•  Many  of  the  regions  from  ICN  5  (ROI  10-18)  interact 
with  ICN  1  (ROI  1-2),  and  also  form  a  loose  interac¬ 
tion  group  with  ICN  14  (ROI  43-45),  the  cerebellum, 
the  most  robust  connection  of  which  appears  to  be 
between  ROI  16  and  43. 

The  robustness  of  each  of  the  interactions  discussed  in  this 
list  is  supported  by  low  interaction  rate  subsampling  RSD, 
shown  in  Figure  4(b). 

3.2.  Impact  of  nonlinearity 

In  this  section  we  demonstrate  that  the  NFM  proce¬ 
dure  both  captures  the  hierarchical  structure  of  ROI  in¬ 
teraction  indicated  by  linear  analyses,  and  reveals  nonlin¬ 
ear  interactions  not  discoverable  by  such  methods.  To  ac¬ 
complish  this,  we  compare  the  population-level  hierarchy 
generated  by  NFM  with  the  results  of  an  analogous  lin¬ 
ear  procedure  involving  pairwise  correlation  analysis.  Fur¬ 
thermore,  we  validate  nonlinear  relationships  suggested  by 
NFM  in  a  stepwise  multiple  regression,  and  an  elastic  net 
regularized  regression,  the  results  of  which  we  describe  at 
the  end  of  this  section. 

3.2.1.  Comparison  with  correlation  analysis 

For  each  of  the  242  subjects,  we  compute  the  correla¬ 
tion  matrix  for  the  52  ROI  time  series.  Squaring  the  ele¬ 
ments  of  the  correlation  matrix  and  normalizing  each  row 
(after  setting  the  diagonal  to  zero)  provides  the  relative 
explained  variance  (relative  R2)  of  the  ROI  corresponding 
to  that  row  by  each  of  the  other  51  ROI.  The  average  of 
the  242  normalized  subject  matrices  is  interpreted  as  the 
linear  version  of  the  population-level  IR  map  generated 
by  NFM.  As  with  IR,  the  reciprocal  of  relative  explained 
variance  can  be  considered  a  distance  between  ROI  (higher 
relative  R2  means  closer).  The  resulting  hierarchy  gener¬ 
ated  by  HCA  is  shown  in  the  bottom  of  Figure  5. 

As  expected,  much  of  the  large-scale  structure  revealed 
by  NFM  is  also  indicated  by  the  linear  correlation  analy¬ 
sis.  The  similarity  of  the  generated  hierarchies  supports 
the  validity  of  the  models  discovered  by  GP  (i.e.,  the  al¬ 
gorithm  is  not  excessively  overhtting  the  data),  and  the 
subtle  differences  between  them  suggest  potentially  inter¬ 
esting  interactions  that  are  missed  if  linearity  is  assumed. 
In  the  following  we  investigate  one  of  these  differences. 

Interaction  of  the  mCC  (ROI  9)  with  the  motor  visu¬ 
ospatial  complex  is  evident  in  both  hierarchies.  However, 
in  the  linear  analysis  it  appears  more  closely  connected 
with  its  own  ICN  (ROI  7,8,  the  bilateral  anterior  insula), 
and  only  with  the  posterior  dorsomedial  prefrontal  cortex 
(dmPFC,  ROI  19)  from  ICN  6.  In  contrast,  NFM  reveals 
that  activity  in  the  mCC  is  related  to  more  components 
of  the  motor  system.  The  nonlinear  models  generated  by 
GP  show  a  strong  connection  between  the  mCC,  posterior 
dmPFC,  and  the  paracentral  lobule  (PL)  of  the  primary 
motor  cortex  (ROI  29  from  ICN  8)  shown  in  red,  green, 
and  blue,  respectively,  in  Figure  6.  Specifically,  about  20% 
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Figure  5:  Hierarchical  cluster  analysis  (HCA)  of  interaction  among  ROI,  generated  with  NFM  (top)  and  correlation  analysis  (bottom). 
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of  all  of  the  models  generated  for  the  activity  in  the  poste¬ 
rior  dmPFC,  across  all  subjects  and  levels  of  complexity, 
contain  both  the  mCC  and  the  PL  as  explanatory  vari¬ 
ables. 

For  many  of  these  models,  the  mCC  and  PL  only  show 
up  as  linear  terms,  so  it  is  reasonable  to  wonder  why  the 
correlation  analysis  did  not  pick  up  this  interaction.  The 
vast  majority  of  models  containing  mCC  and  PL  in  only 
linear  terms  also  contain  nonlinear  terms  in  other  ROl.  It 
is  the  posterior  dmPFC  along  with  these  nonlinear  terms 
that  is  correlated  with  the  nrCC  and  PL.  Thus  the  interac¬ 
tion  is  hidden  from  linear  analyses.  Furthermore,  many  of 
the  models  do  contain  nonlinear  terms  involving  the  mCC 
and  PL.  In  fact,  the  product  of  the  activity  in  these  two 
regions  shows  up  in  78  models  for  the  posterior  dmPFC 
across  21  different  subjects,  and  it  is  always  additive.  This 
term  is  involved  in  models  across  the  spectrum  of  complex¬ 
ity,  including  instances  where  it  is  the  only  term. 


Figure  6:  NFM  reveals  a  nonlinear  interaction  among  these  three 
ROI:  mCC  (red),  posterior  dmPFC  (green),  and  PL  in  the  primary 
motor  cortex  (blue). 

3. 2. 2.  Validation  of  nonlinear  terms 

To  validate  first  order  nonlinearity  (pairwise  product 
and  quotient  terms,  as  well  as  reciprocals)  suggested  by 
NFM,  we  first  randomly  assign  100  subjects  to  a  train¬ 
ing  group,  and  100  different  subjects  to  a  testing  group. 
NFM  results  are  aggregated  over  the  training  group  to  pro¬ 
duce  an  IR  map  and  hierarchy  (not  shown)  summarizing 
ROI  interaction  within  the  training  group  as  a  whole.  The 
roughly  2000  specific  models  generated  by  NFM  for  each 
region  (approximately  20  models  per  training  subject)  are 
then  used  to  inform  the  modeling  of  ROI  activity  in  that 
region  within  the  testing  group,  by  stepwise  regression. 

For  each  ROI  and  each  testing  subject,  we  first  per¬ 
form  a  standard  stepwise  linear  regression  using  the  other 


51  ROI  as  regressors.  We  then  perform  a  stepwise  regres¬ 
sion  including  all  first  order  nonlinear  terms  suggested  by 
NFM  over  the  training  group  in  addition  to  the  51  linear 
regressors.  Statistics  of  the  linear  and  nonlinear  models 
are  compared  to  determine  the  effect  of  including  these 
first  order  terms.  To  illustrate,  we  describe  results  of  the 
validation  procedure  for  the  posterior  dmPFC  (ROI  19) 
here. 

We  show  a  histogram  of  increase  in  the  percentage  of 
explained  variance  for  the  nonlinear  versus  linear  regres¬ 
sion  models  for  the  posterior  dmPFC  in  Figure  7.  The 
inclusion  of  first  order  nonlinear  terms  suggested  by  NFM 
over  the  training  group  increases  the  percentage  of  ex¬ 
plained  variance  for  every  test  subject,  with  a  mean  in¬ 
crease  of  12.5%  and  maximum  increase  of  42%.  The  non¬ 
linear  models  contain  more  terms  (mean  46,  compared 
with  mean  19  for  linear  models),  so  a  potential  concern 
is  that  the  increase  in  R 2  might  simply  be  a  result  of  the 
additional  degrees  of  freedom.  However,  for  each  test  sub¬ 
ject  the  nonlinear  model  F-statistic  is  also  greater  than 
that  of  the  linear  model  (mean  increase  of  83.5,  maximum 
increase  of  1000),  and  comparisons  of  adjusted  R2,  which 
account  for  differences  in  degrees  of  freedom,  show  only 
slightly  smaller  increases  for  all  test  subjects.  This  sug¬ 
gests  that  the  increase  in  explained  variance  is  due  to  ex¬ 
planatory  power  of  the  nonlinear  terms,  and  not  simply 
the  additional  degrees  of  freedom  in  the  nonlinear  models. 


Figure  7:  Histogram  of  increase  in  explained  variance.  The  inclusion 
of  first  order  nonlinear  terms  suggested  by  NFM  over  the  training 
group,  in  a  stepwise  regression  analysis  for  the  posterior  dmPFC  in 
the  testing  group,  increases  the  percentage  of  explained  variance  for 
every  test  subject,  with  a  mean  increase  of  12.5%  and  maximum 
increase  of  42%. 

To  further  support  the  validity  of  the  nonlinear  terms 
suggested  by  NFM,  we  apply  a  similar  testing  approach 
using  a  machine  learning  algorithm  called  elastic  net  regu¬ 
larized  regression.  In  contrast  to  stepwise  regression,  reg¬ 
ularization  allows  for  the  inclusion  of  highly  correlated  ex¬ 
planatory  variables,  while  simultaneously  discounting  re¬ 
gressors  with  very  small  coefficients.  Regularized  mod¬ 
els  can  have  more  explanatory  power  or  fewer  terms  (or 
both),  with  respect  to  those  from  stepwise  regression.  We 
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see  each  of  these  scenarios  in  the  present  case.  Using  the 
same  training  and  testing  groups,  and  modeling  the  same 
ROI  (the  posterior  dmPFC),  elastic  net  regularization  pro¬ 
duces  linear  models  with  an  average  of  45  terms  that  ex¬ 
plain  roughly  the  same  amount  of  variance  (on  average)  as 
the  nonlinear  models  generated  with  stepwise  regression, 
improving  upon  the  explanatory  power  of  their  stepwise 
counterparts.  However,  the  regularized  nonlinear  models 
provide  that  same  explanatory  power  with  a  mean  of  only 
26  terms.  Furthermore,  the  regularized  nonlinear  model  is 
preferable  to  the  regularized  linear  model,  as  determined 
by  the  Akaike  information  criterion,  for  every  single  test 
subject. 

3.3.  Group-level  variation 

Variation  among  individuals  (illustrated  in  Appendix 
B)  suggests  that  statistics  of  interaction  rates  among  ROI 
may  differ  between  phenotypic  groups.  The  hierarchical 
organization  of  ROI  induced  by  IR  might  illuminate,  in 
such  cases,  variation  in  functional  dynamics  associated 
with  demographic,  behavioral,  or  genetic  characteristics. 
An  example  illustrating  this  potential  is  provided  by  the 
contrast  between  drinking  (D)  and  non-drinking  (ND)  ado¬ 
lescents  from  the  IMAGEN  dataset.  In  Figure  8,  we  show 
hierarchies  for  the  top  and  bottom  100  subjects  in  terms 
of  lifetime  drinking  score,  determined  by  self-report  ques¬ 
tionnaire,  corresponding  to  those  who  have  had  2  or  more 
lifetime  drinks,  and  those  who  have  had  1  or  fewer,  respec¬ 
tively.  The  two  hierarchies  are  similar  to  one  another  (and 
comparable  to  the  population  level  hierarchy),  but  sub¬ 
tle  differences  between  them  suggest  group-differentiating 
factors. 

•  The  ROI  pair  3,18,  the  subgenual  anterior  cingulate 
cortex  (ACC)  and  fornix  body,  respectively,  are  cou¬ 
pled  in  both  the  D  and  ND  groups.  However,  their 
arrangement  in  the  hierarchies  is  different,  as  we'll 
describe  in  a  moment,  resulting  from  the  following 
two  distinguishing  interaction  rates. 

•  For  the  ND  group,  there  is  a  22%  lower  IR  between 
ROI  6,  the  left  globus  pallidus,  and  the  fornix  body, 
ROI  18.  We  note  that  this  reduced  interaction  is 
completely  missed  by  pairwise  correlation  analysis, 
(which  indicates  a  slightly  reduced  interaction  among 
drinkers,  see  Appendix  C),  and  thus  appears  to  be 
an  entirely  nonlinear  effect. 

•  In  contrast,  there  is  a  33%  higher  intra-network  IR 
within  ICN  2,  comprised  of  ROI  3-4,  the  subgenual 
ACC  and  the  ventromedial  prefrontal  cortex  (vmPFC), 
respectively,  among  non-drinkers.  Though  this  dif¬ 
ference  is  also  indicated  by  correlation  analysis,  only 
about  half  of  the  effect  is  captured  (a  16%  elevation). 

•  These  two  differences  in  interaction  cooperate  to  shuf¬ 
fle  the  hierarchical  arrangement  of  ROI  in  the  D 
versus  ND  group.  The  subgenual  ACC  and  fornix 


body  are  most  closely  associated  with  the  default 
mode  network  in  non-drinkers,  through  the  vmPFC. 
Among  drinkers,  in  contrast,  they  are  grouped  di¬ 
rectly  with  the  bilateral  globus  pallidus  of  ICN  3 
(ROI  5-6).  In  other  words,  in  drinkers  there  is  a 
tighter  coupling  among  ROI  most  strongly  linked  to 
reward  and  thirst  tasks  as  reported  in  Laird  et  al. 
(2011).  The  relevant  ROI  are  shown  in  Figure  9. 


Figure  9:  Interaction  between  the  subgenual  ACC  (top  red)  and 
vmPFC  (bottom  red)  is  lower  among  drinkers,  who  also  show  ele¬ 
vated  interaction  between  the  left  globus  pallidus  (green)  and  fornix 
body  (blue),  an  apparently  nonlinear  effect. 

•  The  largest  single  difference  between  the  D  and  ND 
groups  is  a  74%  elevated  IR  between  the  right  an¬ 
gular  gyrus  (ROI  41  in  the  default  mode  network) 
and  ROI  11,  the  posterior  cingulate  cortex,  among 
drinkers.  These  ROI  are  shown  in  red  and  green,  re¬ 
spectively,  in  Figure  10.  About  half  of  this  effect  is 
captured  by  correlation  analysis. 

4.  Discussion 

The  large  extent  of  ICN  reproduction,  and  their  hier¬ 
archical  organization  into  functional  groups  using  an  en¬ 
tirely  different  approach  than  that  described  in  Laird  et  al. 
(2011),  provides  strong  evidence  for  the  analytical  poten¬ 
tial  of  NFM.  Furthermore,  the  technique  reveals  nonlin¬ 
ear  interactions  that  are  not  discoverable  with  standard 
linear  techniques,  or  without  prior  hypotheses.  Such  rela¬ 
tionships  could  provide  a  new  window  into  brain  function, 
and  this  highlights  the  potential  of  the  methodology  as 
a  hypothesis  generator.  Of  course  proper  care  must  be 
taken  (with  regard  to  independence  of  observations,  etc.) 
in  the  ensuing  investigations  of  such  data-driven  hypothe¬ 
ses.  Nonetheless,  hypothesis  generation  is  a  powerful  tool 
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Figure  8:  Hierarchies  for  groups  with  high  (top)  and  low  (bottom)  alcohol  consumption  rates,  defined  by  two  or  more  lifetime  drinks  and  one 
or  fewer  lifetime  drinks,  respectively. 
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fruitful  avenue  for  future  theoretical  research. 


Figure  10:  Interaction  between  the  right  angular  gyrus  from  the 
default  mode  network  (red),  and  posterior  cingulate  cortex  (green) 
is  74%  higher  among  drinkers. 

for  scientific  exploration,  and  has  been  used  recently  to 
inform  biomedical  research,  such  as  in  Abedi  et  al.  (2012) 
and  Spangler  et  al.  (2014). 

In  addition  to  providing  insight  on  its  own,  the  NFM 
procedure  complements  other  modes  of  analysis.  A  po¬ 
tentially  promising  extension,  especially  for  a  hybrid  ver¬ 
sion  capable  of  voxel  level  analysis  (discussed  in  Appendix 
D),  would  be  to  use  it  in  conjunction  with  graph-theoretic 
analyses  such  as  those  described  in  Bassett  and  Bullnrore 
(2006),  Stam  and  Reijneveld  (2007),  and  van  den  Heuvel 
et  al.  (2008).  The  general  technique,  as  detailed  in  Bull- 
more  and  Sporns  (2009)  and  Rubinov  and  Sporns  (2010), 
is  to  compute  pairwise  correlations  among  all  voxels,  set 
a  threshold  above  which  two  voxels  are  considered  con¬ 
nected,  and  calculate  various  network  summary  measures 
(e.g.,  degree  distribution,  assortativity,  diameter,  etc.).  By 
simply  replacing  correlations  in  these  networks  with  inter¬ 
action  rates  determined  by  NFM,  the  assumption  of  lin¬ 
earity  is  left  behind. 

Finally,  it  is  important  to  note  that  the  specific  forms 
of  the  models  in  the  output  of  the  GP  algorithm  have  been 
analyzed  simplistically  in  the  present  work.  A  major  po¬ 
tential  benefit  of  NFM  is  the  insight  that  might  be  gained 
from  precisely  analyzing  these  mathematical  descriptions 
of  the  relationships  among  ROI  in  the  brain.  Of  course, 
ascribing  meaning  to  any  particular  one  of  these  models 
would  have  to  be  done  cautiously.  However,  given  the  re¬ 
sults  we  describe  here,  obtained  by  a  coarse  treatment,  the 
collection  of  models  determined  by  GP  may  offer  a  number 
of  as  yet  undiscovered  insights.  This  seems  a  potentially 


5.  Conclusions 

Results  produced  in  our  study  suggest  that  there  is  po¬ 
tential  analytical  power  in  the  use  of  NFM,  or  some  mod¬ 
ification  thereof,  in  the  neuroimaging  domain.  The  proce¬ 
dure  we  investigated  here  utilizes  commercially  available, 
out-of-the-box  GP  software,  and  preliminary  statistical 
analysis  of  its  output.  Many  improvements  and  extensions 
are  possible,  only  some  of  which  we  have  suggested  in  this 
work.  Reproduction  of  recent  results  constitutes  a  measure 
of  cross-validation,  and  the  preliminary  results  presented 
demonstrate  the  unique  capability  of  NFM  to  discover  non¬ 
linear  relationships  among  regions  of  the  brain  that  hold 
promise  for  illuminating  differences  in  brain  function  be¬ 
tween  subject  groups.  Further,  the  mathematical  char¬ 
acterizations  we  have  achieved,  which  are  not  limited  by 
linear  or  univariate  assumptions,  are  ripe  for  future  inves¬ 
tigation. 
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Appendix  A.  Table  of  ROI 

In  Table  A.l  we  list  all  52  ROI  investigated  in  this 
study  by  number,  give  their  anatomical  names,  indicate 
the  ICN  from  within  which  they  were  defined,  and  provide 
visual  representations  of  their  locations  within  the  brain. 
Due  to  its  length,  the  table  appears  after  the  References. 

Appendix  B.  Subject-level  variation 

Figure  B.ll  contains  individual  subject  interaction  hi¬ 
erarchies  for  two  different  (randomly  selected)  subjects, 
generated  by  100  random  restarts  of  the  GP  algorithm 
and  subsequent  normalized  frequency  analysis.  Though 
the  two  hierarchies  are  quite  different  from  one  another, 
they  do  show  some  network  organization  similar  to  that 
illustrated  in  the  population  level  hierarchy  (top  of  Figure 
5). 

•  Portions  of  the  visual  cluster  (ROI  32-38)  are  intact 
in  each  case. 
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•  Many  of  the  two-region  networks  remain  together, 
e.g.  "  ICN  1  (ROI  1,2),  ICN  16  (ROI  48,49),  and 
though  associated  with  other  ROI,  also  ICN  3  (ROI 
5,6)  and  ICN  17  (ROI  50,51) 

•  The  default  mode  network  (ROI  39-42)  is  mostly  in¬ 
tact  in  each  subject. 

The  interaction  profile  of  the  default  mode  network 
illustrates  an  interesting  distinction  between  the  two  sub¬ 
jects.  For  the  top  subject,  the  network  is  fully  intact,  in¬ 
teracting  with  ROI  4  (consistent  with  the  population  level 
hierarchy),  and  also  interacting  with  ROI  26  from  the  mo¬ 
tor  visuospatial  complex.  For  the  bottom  subject,  three  of 
the  four  ROI  in  the  network  remain  together,  but  interact 
instead  with  several  other  ROI  from  the  emotional  intero¬ 
ceptive  class  instead  of  ROI  4,  specihcally  ROI  10,  12,  and 
16,  and  a  different  ROI  from  the  motor  visuospatial  com¬ 
plex  as  well  (ROI  29  instead  of  26).  By  themselves,  these 
dendrogram  comparisons  offer  no  conclusive  evidence  re¬ 
garding  connections  between  cognitive  processes.  How¬ 
ever,  an  experiment  could  be  designed  to  test  if  any  infer¬ 
ences  can  be  made  from  such  distinctions.  For  example, 
the  administration  of  post-scan  surveys  might  grant  some 
interpretability  to  the  specifics  of  these  single- subject  in¬ 
teraction  hierarchies. 

Appendix  C.  Linear  HCA  of  alcohol  consumption 

Here  we  demonstrate  that  the  shuffling  of  the  interac¬ 
tion  hierarchy  in  drinking  (D)  versus  non-drinking  (ND) 
adolescents  discovered  by  NFM  is  not  uncovered  by  linear 
correlation  analysis.  To  perform  group-level  correlation 
analysis,  the  normalized  relative  R2  matrices  for  each  sub¬ 
ject  (described  in  Section  3.2)  are  averaged  over  the  100 
subjects  in  each  group.  Recall  that  these  matrices  are  gen¬ 
erated  for  each  subject  by  computing  the  correlation  ma¬ 
trix  for  the  52  ROI  time  series,  squaring  the  elements,  and 
normalizing  each  row  (after  setting  the  diagonal  to  zero). 
The  reciprocal  of  relative  explained  variance  can  be  con¬ 
sidered  a  distance  between  ROI  (higher  relative  R2  means 
closer),  and  the  resulting  D  and  ND  hierarchies  generated 
by  HCA  are  shown  in  the  top  and  bottom,  respectively,  of 
Figure  C.12. 

Comparison  with  Figure  8  suggests  that  this  linear 
analysis  partially  uncovers  a  distinguishing  difference  in 
interaction  between  drinking  and  non-drinking  adolescents. 
Specifically,  among  non-drinkers,  a  higher  intra-network 
interaction  within  ICN  2,  comprised  of  the  subgenual  ACC 
and  the  vmPFC  (ROI  3  and  4,  respectively)  is  detected 
here.  The  result  is  an  indirect  coupling,  within  non-drinkers, 
of  the  default  mode  network  (ROI  39-42)  and  the  complex 
comprised  of  ROI  3,18,5,6,  through  the  vmPFC. 

The  results  of  NFM  provide  further  insight  in  two  im¬ 
portant  ways.  First,  the  elevated  interaction  within  ICN 
2  among  non-drinkers  is  detected  at  twice  the  strength. 
Second,  the  main  interaction  responsible  for  grouping  the 


complex  of  ROI  3,18,5,6,  specifically  the  interaction  be¬ 
tween  the  left  globus  pallidus  (ROI  6)  and  fornix  body 
(ROI  18),  is  lower  among  non-drinkers.  This  second  effect 
is  entirely  missed  by  correlation  analysis,  suggesting  that 
it  is  nonlinear  in  nature.  The  result  of  capturing  these 
effects  together ,  as  shown  in  Figure  8,  is  a  breakup  of 
the  complex  in  non-drinkers,  for  whom  ROI  3,18  are  sep¬ 
arated  from  the  bilateral  globus  pallidus  of  ICN  3  (ROI 
5-6).  This  breakup  is  suggestive,  as  each  of  these  ROI  is 
associated  with  emotion,  reward,  and  interoceptive  pro¬ 
cesses  such  as  thirst,  and  experiments  reporting  activity 
in  ICN  5,  including  the  fornix  body  (ROI  18),  predomi¬ 
nantly  involved  interoceptive  stimulation,  as  reported  in 
Laird  et  al.  (2011). 

Appendix  D.  Improvements  and  modifications 

The  GP  implementation  we  used  for  this  study  is  the 
commercially  available  package  Eureqa  from  Nutonian,  as 
described  in  Schmidt  and  Lipson  (2009).  Though  much 
of  its  behavior  can  be  controlled  through  the  interface  or 
command  line,  it  is  proprietary  code  and  thus  somewhat 
of  a  black  box.  There  are  many  reasons  why  a  dedicated, 
open  source  implementation  of  GP  would  be  more  desir¬ 
able. 

A  major  challenge  for  this  method  of  analysis  is  the 
computational  expense  of  running  a  large  number  of  GP 
searches.  Generating  the  IR  map  for  a  single  subject  re¬ 
quires  a  large  number  of  random  restarts  for  each  ROI.  For 
example,  running  100  restarts  for  each  of  the  52  ROI  in 
this  study,  allowing  1  core-hour  for  each  search,  requires 
over  10  hours  with  access  to  500  dedicated  processors.  The 
procedure  as  described  here  is  likely  computationally  pro¬ 
hibitive  for  running  analyses  on  large  numbers  of  subjects, 
or  for  larger  collections  of  ROI.  Intelligent  stopping  crite¬ 
ria,  and  many  other  approaches  to  the  mitigation  of  com¬ 
putational  expense,  have  been  reported  at  length  in  the 
GP  literature,  an  example  of  which  is  the  use  of  graph¬ 
ics  processors  reported  in  Harding  and  Banzhaf  (2007).  It 
may  also  be  possible  to  determine  an  ideal  (and  smaller) 
number  of  restarts  that  balances  computation  time  with 
the  statistical  power  of  the  resulting  IR  map. 

It  should  also  be  noted  that  for  collections  of  ROI  much 
larger  than  that  considered  here,  in  addition  to  the  com¬ 
putational  expense  resulting  from  more  required  searches, 
each  search  will  take  much  longer  to  produce  meaningful 
models  due  to  the  larger  number  of  possible  explanatory 
variables.  A  hybrid  method  of  symbolic  regression  employ¬ 
ing  a  machine  learning  algorithm  called  FFX  (Fast  Func¬ 
tion  Extraction)  described  in  McConaghy  (2011)  as  a  first 
pass,  and  then  GP,  has  great  potential  for  the  treatment 
of  higher  dimensional  data,  e.g.,  large  numbers  of  ROI. 
A  prototype  of  this  method  was  reported  in  Icke  et  al. 
(2014).  FFX  is  a  deterministic  algorithm  that  builds  up 
models  with  nonlinear  terms  (e.g.,  products  of  ROI  signal) 
in  a  prescribed  fashion  and  evaluates  explanatory  power  at 
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Figure  B.ll:  Example  hierarchies  for  two  different  individual  subjects.  Note  the  large  degree  of  variation  between  the  two. 
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Figure  C.12:  Linear  hierarchies  for  groups  with  high  (top)  and  low  (bottom)  alcohol  consumption  rates,  defined  by  two  or  more  lifetime  drinks 
and  one  or  fewer  lifetime  drinks,  respectively. 
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each  stage.  By  ruling  out  ROI  that  are  likely  not  explana¬ 
tory  at  each  stage,  the  algorithm  reduces  the  dimension¬ 
ality  of  the  search.  In  other  words,  at  the  cost  of  reduced 
breadth  in  the  search  space,  the  algorithm  provides  huge 
reductions  in  computation  time  in  addition  to  reducing  the 
number  of  variables  that  will  eventually  be  injected  into 
the  GP  algorithm.  Implemented  effectively,  this  hybrid 
algorithm  could  eliminate  the  necessity  of  ROI  selection 
completely  by  allowing  direct  regression  over  voxel  signals. 

An  ever-present  concern  in  the  analysis  of  fMRI  is  the 
level  of  noise  in  the  data.  Particularly  in  the  case  of  re¬ 
gressing  over  voxel  signals,  low  signal-to-noise  ratio  is  a 
major  challenge,  and  indeed  GP  efficacy  is  diminished  in 
such  circumstances.  However,  there  has  been  some  work 
on  modifying  the  GP  algorithm  to  better  manage  noisy 
data,  an  example  of  which  is  the  inclusion  of  noise  genera¬ 
tors  called  stochastic  elements  with  user-defined  distribu¬ 
tions  (e.g.,  Gaussian  or  uniform)  as  potential  explanatory 
“variables”.  These  generators  can  themselves  end  up  in¬ 
side  complex  functions  within  the  models,  providing  those 
models  the  capability  of  reproducing  realistic  noise  distri¬ 
butions  more  likely  to  be  at  play  than  the  typical  Gaussian. 
There  is  no  guarantee  that  this  modification  will  prove 
beneficial  in  the  case  of  fMRI,  but  it  has  been  shown,  in 
Schmidt  and  Lipson  (2007),  to  effectively  identify  exact 
underlying  analytical  models  in  the  presence  of  nonlinear, 
non-Gaussian  and  nonuniform  noise. 
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Table  A.l:  Table  of  ROI 


ROI  ICN 


ROI  Description 


1  1 


left  anterior  hippocampus 


right  anterior  hippocampus 


subgenual  anterior  cingulate  cortex,  anterior 
caudate 


ventromedial  prefrontal  cortex,  medial 
frontal  gyrus 


6  3 


right  globus  pallidus 


left  globus  pallidus 


7  4 


right  anterior  insula 


left  anterior  insula 


middle  cingulate  cortex,  dorsomedial 
prefrontal  cortex 


Visualization 


36 

Approved  for  public  release;  distribution  is  unlimited. 


ROI  ICN 


10  5 


11  5 


12  5 


ROI  Description 


inferior  cerebellum 


posterior  cingulate  cortex 


inferior  vermis 


13  5 


14  5 


15  5 


inferior  vermis 


anterolateral  cerebellum 


anterolateral  cerebellum 


16  5 


18  5 


posterior  cerebellum 


17  5  inferior  colliculus,  anterior  vermis 


fornix  (body) 


19  6  posterior  dorsomedial  prefrontal  cortex 


20  6  left  superior  precentral  gyrus 


21  6  right  posterior  superior  parietal  cortex 


Visualization 
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ROI  ICN 


22  7 


23  7 


24  7 

25  7 


26  7 


27  7 


28  8 


29  8 


30  8 


31  8 


32  10 


33  10 


34  11 


35  11 


36  12 


ROI  Description 


left  superior  parietal  cortex 


right  precuneus 


left  superior  parietal  cortex 


right  superior  parietal  cortex 


left  posterior  dorsolateral  prefrontal  cortex 


right  posterior  dorsolateral  prefrontal  cortex 


left  postcentral  gyrus 


paracentral  lobule 


anterior  inferior  parietal  cortex 


right  postcentral  gyrus 


right  lateral  occipital  cortex 


left  lateral  occipital  cortex 


left  occipital  pole 


right  occipital  pole 


lingual  gyrus 


37 


12 


right  cuneus 


38 


12 


right  fusiform  gyrus 


Visualization 
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ROI  ICN 


ROI  Description 


39  13 


40  13 


41  13 


43  14 


44  14 


45  14 


posterior  cingulate  cortex 


left  angular  gyrus 


right  angular  gyrus 


42  13  anterior  dorsomedial  prefrontal  cortex 


right  superior  cerebellum 


vermis 


left  superior  cerebellum 


46  15 


47  15 


right  middle  frontal  gyrus 


right  supramarginal  gyrus 


48  16 


49  16 


left  superior  temporal  gyrus 


right  superior  temporal  gyrus 


Visualization 
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ROI 


ICN 


ROI  Description 


50 


17 


right  inferior  pre  and  post  central  gyrus 


51 


17 


left  inferior  pre  and  post  central  gyrus 


52 


18 


left  inferior  frontal  gyrus 


Visualization 
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4  Symbolically  regressing  satellite  imagery. 

Large-scale  imaging  of  large-scale  regions  of  the  Earth’s  surface  presents  many  opportunities  for 
advances  in  understanding.  In  a  series  of  manuscripts  we  have  shown  that  symbolic  regression  is 
particularly  useful  in  this  domain. 

In  m  we  demonstrated  that  symbolic  regression  could  be  adapted  such  that  it  draws  intelli¬ 
gently  on  sensors  that  are  more  or  less  costly  to  query.  The  resulting  model  can  successfully  adapt 
its  prediction  as  the  cost  and/or  availability  of  different  sensing  systems  fluctuate  over  time. 

In  0  we  demonstrated  that  symbolic  regression  is  particularly  useful  for  predicting  an  envi¬ 
ronmental  variable  of  interest — in  this  case,  the  amount  of  water  contained  in  snow  pack — across 
a  wide  region. 

Finally,  in  (6l  we  have  demonstrated  that  symbolic  regression  can  act  as  a  form  of  compressed 
sensing:  it  can  be  successfully  trained  even  if  the  data  set  is  underdetermined  (more  features  than 
observations).  This  works  by  enabling  symbolic  regression  not  just  to  optimize  the  structure  of 
the  model,  but  also  to  optimize  the  structure  of  sub-regions  across  the  surface  of  the  Earth  to 
be  modeled  within  which  observations  are  averaged  (compressed).  This  produces  particularly 
intuitive  models,  because  a  lay  observer  can  gain  intuition  into  the  model  by  observing  what  regions 
it  draws  averages  from  for  prediction. 

4.1  Relevance  for  U.S.  defense  and  security. 

It  would  be  useful  for  military  personnel  or  stakeholders  to  not  only  obtain  predictions  about  en¬ 
vironmental  variables  that  change  over  time — snow,  vegetation,  rainfall — but  to  easily  understand 
how  a  model  predictions  those  variables  may  change  in  the  near  future.  The  methods  presented  in 
the  three  manuscripts  that  follow  provide  one  such  method  to  do  so. 

4.2  Yousefi  et  al.  “A  Genetic  Programming  Approach...”  (2015). 

A  technical  manuscript  describing  how  symbolic  regression  can  balance  prediction  and  cost  effec¬ 
tiveness  follows. 
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ABSTRACT 

Resource  constrained  sensor  systems  are  an  increasingly  attractive 
option  in  a  variety  of  environmental  monitoring  domains,  due  to 
continued  improvements  in  sensor  technology.  However,  sensors 
for  the  same  measurement  application  can  differ  in  terms  of  cost 
and  accuracy,  while  fluctuations  in  environmental  conditions  can 
impact  both  application  requirements  and  available  energy.  This 
raises  the  problem  of  automatically  controlling  heterogeneous  sen¬ 
sor  suites  in  resource  constrained  sensor  system  applications,  in  a 
manner  that  balances  cost  and  accuracy  of  available  sensors.  We 
present  a  method  that  employs  a  hierarchy  of  model  ensembles 
trained  by  genetic  programming  (GP):  if  model  ensembles  that  poll 
low-cost  sensors  exhibit  too  much  prediction  uncertainty,  they  au¬ 
tomatically  transfer  the  burden  of  prediction  to  other  GP-trained 
model  ensembles  that  poll  more  expensive  and  accurate  sensors. 
We  show  that,  for  increasingly  challenging  datasets,  this  hierarchi¬ 
cal  approach  makes  predictions  with  equivalent  accuracy  yet  lower 
cost  than  a  similar  yet  non-hierarchical  method  in  which  a  sin¬ 
gle  GP-generated  model  determines  which  sensors  to  poll  at  any 
given  time.  Our  results  thus  show  that  a  hierarchy  of  GP-trained 
ensembles  can  serve  as  a  control  algorithm  for  heterogeneous  sen¬ 
sor  suites  in  resource  constrained  sensor  system  applications  that 
balances  cost  and  accuracy. 

Categories  and  Subject  Descriptors 

1.2  [Computing  Methodologies]:  Artificial  Intelligence 

Keywords 

Genetic  Programming,  Resource  Constrained  Sensor  Systems,  Cost- 
Sensitive  Control,  Sensor  Fusion 

1.  INTRODUCTION 

Resource  constrained  sensor  systems  (RCSS)  such  as  Wireless 
Sensor  Networks  have  revolutionized  environmental  monitoring  by 
combining  low  cost  with  flexibility  in  sensor  capabilities  [29].  They 
have  been  used  in  diverse  environmental  monitoring  applications 
and  continue  to  be  adapted  in  new  fields.  Because  RCSS  are  often. 
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even  typically,  deployed  in  remote  locations,  and  thus  rely  on  com¬ 
binations  of  battery  power  and  energy  harvesting,  a  major  challenge 
in  RCSS  design  is  to  minimize  system  power  consumption. 

Minimizing  power  consumption  can  be  accomplished  in  a  vari¬ 
ety  of  ways,  in  particular  by  adapting  sensor  control  strategies  that 
optimize  the  balance  between  measurement  accuracy  and  the  cost 
of  powering  sensors  [28],  In  this  paper,  we  propose  new  sensor  con¬ 
trol  algorithms  for  RCSS  with  heterogeneous  sensor  suites  that  bal¬ 
ance  cost  and  accuracy,  obtained  using  genetic  programming  (GP) 
techniques. 

By  “heterogeneous  sensor  suite”,  we  mean  RCSS  equipped  with 
multiple  types  of  sensors  for  prediction  of  the  same  phenomena. 
Each  of  these  sensors  is  characterized  by  its  accuracy  in  relation 
to  the  phenomena,  and  a  cost  of  use  which  is  often  measured  by 
its  power  consumption.  Such  systems  support  multi-modal  sensor 
fusion,  a  well-studied  technique  where  data  from  multiple  sensor 
modalities  (types)  is  combined  to  predict  a  single  variable  [28].  The 
contribution  of  our  work  is  a  consideration  of  cost  in  multi-modal 
sensor  fusion,  and  the  development  and  testing  of  associated  con¬ 
trol  algorithms.  These  algorithms  will  call  upon  particular  sensors 
only  when  needed,  and  otherwise  rely  on  the  cheapest  available 
sensors  at  any  given  time.  Our  problem  is  distinguished  from  adap¬ 
tive  sampling  [28].  in  that  the  latter  is  concerned  with  optimally 
modulating  sampling  frequency  of  a  given  sensor,  not  choosing  be¬ 
tween  a  suite  of  possible  sensors. 

While  various  multi-modal  sensor  fusion  applications  exist,  we 
are  especially  interested  in  the  Snowcloud  system  which  combines 
snow  density  telemetry  with  snow  depth  and  air  temperature  sen¬ 
sors  to  predict  areal  snow  water  equivalent  (SWE)  [22],  We  envi¬ 
sion  extending  Snowcloud  to  incorporate  ground  based  light  detec¬ 
tion  and  ranging  (LIDAR)  scanning  [4]  to  be  used  for  SWE  esti¬ 
mation  as  part  of  its  sensor  suite.  However,  while  LIDAR  yields 
more  accurate  data  than  existing  Snowcloud  telemetry,  it  does  so 
at  significant  additional  power  cost.  Thus,  the  challenge  is  to  com¬ 
mit  these  resources  only  at  optimal  times.  It  is  also  a  refinement 
of  multi-modal  sensor  fusion,  since  we  are  mainly  interested  in  set¬ 
tings  where  available  data  gathering  techniques  differ  in  accuracy, 
with  less  accurate  sensors  being  cheaper  than  more  accurate  ones. 

A  fundamental  component  of  our  approach  is  the  use  of  pre¬ 
diction  uncertainty  to  drive  sensor  usage.  We  propose  a  scheme 
whereby  predictions  are  attempted  using  lower-cost  sensors  at  first. 
If  uncertainty  is  below  an  acceptable  threshold,  then  the  predic¬ 
tion  is  used.  Otherwise  we  switch  to  higher-cost  sensors,  make  a 
new  prediction  based  on  those  inputs,  evaluate  uncertainty  again, 
and  continue  to  move  the  burden  of  prediction  to  more  accurate 
and  costly  sensors  as  needed.  This  scheme  is  discussed  in  detail  in 
Section  3.4  and  described  graphically  in  Figure  2.  Note  that  while 
the  Snowcloud  system  is  an  intended  application  of  this  scheme,  it 
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can  be  generalized  to  any  RCSS  application  using  heterogeneous 
sensor  suites  comprising  sensors  with  varying  cost  and  accuracy. 

To  quantify  uncertainty  we  are  aided  by  machine  learning  en¬ 
semble  methods-  we  use  entropy  in  ensemble  predictions  as  a  proxy 
for  uncertainty  [21],  To  obtain  predictive  models  themselves,  in 
this  work  we  use  genetic  programming  (GP).  This  is  largely  due 
to  characteristics  of  our  intended  application  space.  Previous  work 
has  demonstrated  that  the  relationships  between  snow  cover  and 
the  topographic  and  meteorological  factors  that  influence  it  include 
nonlinearities  [24],  while  the  spatial  distribution  of  SWE  is  non¬ 
linear  because  it  is  influenced  simultaneously  by  various  forcing 
effects  [25].  Nonlinear  predictors  are  therefore  desirable.  Further¬ 
more,  recent  results  [6]  show  that  GP  has  advantages  over  other  ap¬ 
proaches  (such  as  C4.5)  due  to  associated  techniques  for  preventing 
overfitting,  e.g.  treating  model  size  minimization  as  an  objective 
[11].  Although  C4.5  only  supports  classification,  sufficiently  fine 
classification  granularity  can  achieve  competitive  performance  on 
regression  problems,  and  this  approach  is  popular  in  the  environ¬ 
mental  science  community  [6],  Finally,  GP  is  appealing  due  to  its 
white-box  nature:  it  can  potentially  provide  physical  insights  into 
modeled  phenomena. 

An  alternative  approach  to  our  problem  is  to  not  rely  on  external 
measures  of  entropy  to  switch  between  sensors,  but  to  treat  cost  as 
an  additional  objective  in  a  multi-objective  optimization  problem. 
We  explore  this  option  in  our  work,  in  direct  comparison  to  the  hier¬ 
archical  approach.  However,  due  to  the  “curse  of  dimensionality”, 
adding  another  optimization  dimension  may  have  deleterious  ef¬ 
fects  on  prediction  performance,  especially  since  selection  for  size 
to  avoid  overfitting  already  imposes  a  multi-objective  optimization 
regime  [5].  We  therefore  hypothesize  that  a  hierarchical  approach 
will  outperform  a  non-hierarchical  approach  in  settings  where  mul¬ 
tiple  sensors  with  differing  predictive  abilities,  and  we  explore  this 
comparison  in  our  experiments. 

2.  RELATED  WORK 

Previous  work  on  adaptive  sampling  [28]  has  aimed  to  reduce 
sampling  rates  in  RCSS  applications  to  balance  sensor  cost  and  ac¬ 
curacy.  In  particular,  Alippi  et  al.  [3]  have  tried  to  find  the  optimal 
adaptive  frequency  of  sampling  for  avalanche  monitoring.  It  has 
further  been  claimed  that  compressed  sensing  —  sending  aggre¬ 
gated  data  instead  of  raw  data  —  performs  better  in  conjunction 
with  reducing  sampling  rates,  rather  than  just  reducing  the  sam¬ 
pling  rate  alone  [15].  A  variety  of  methods  for  compressed  sensing 
[8]  have  been  proposed.  Although  these  methods  have  achieved 
cost  reduction  in  monitoring,  they  are  not  applicable  to  our  problem 
since  we  intend  not  to  change  the  rate  of  sampling  one  sensor  type, 
but  rather  to  reduce  sampling  cost  by  switching  between  available 
sensors  of  different  type  and  accuracy. 

Another  line  of  work  focuses  on  finding  the  optimal  location  for 
sensors  in  distributed  deployments,  in  order  to  maximize  accuracy 
while  minimizing  deployment  densities.  Krause  et  al.  [13]  have 
used  a  probabilistic  method  to  predict  the  communication  cost  for 
a  given  deployment  topology.  Papadimitriou  et  al.  [17]  have  em¬ 
ployed  GP  and  a  Bayesian  statistical  method  to  minimize  entropy 
over  a  set  of  sensor  locations.  In  contrast,  our  work  is  concerned 
with  reducing  the  cost  of  sampling  from  an  available  set  of  sensors 
at  any  given  time,  not  with  reducing  the  densities  of  sensor  topolo¬ 
gies. 

In  work  on  so-called  multi-modal  sensor  fusion,  data  from  mul¬ 
tiple  sensors  in  a  potentially  heterogeneous  suite  are  aggregated  to 
monitor  a  specific  measurement  application  [26,  9].  This  method 
has  been  widely  used,  for  example  in  visual  monitoring  [16.  18] 
and  target  tracking  [19,  23].  Data-fusion  focuses  on  sensor  appli¬ 


cations  that  need  to  compute  the  correlation  between  multiple  sen¬ 
sor  modules  and  cannot  be  measured  by  a  single  sensor.  However, 
these  works  do  not  consider  the  cost  of  using  different  sensors,  or 
minimizing  cost. 

Cost  sensitive  multi-modal  sensor  fusion  methods  have  been  de¬ 
veloped  to  balance  cost  against  accuracy,  with  an  eye  towards  pro¬ 
viding  fault  tolerance  [12].  However,  we  are  not  concerned  with 
fault  tolerance,  but  strictly  between  selecting  sensors  from  hetero¬ 
geneous  suites.  Willett  et  al.  [28]  use  a  small  number  of  sensors  to 
send  their  readings  to  a  fusion  center,  and  based  on  the  correlation 
among  the  sensed  data,  the  fusion  center  decides  which  additional 
sensors  should  be  activated.  The  same  concept  has  also  been  tried 
in  a  distributed  fashion  [14],  However,  sensing  costs  in  these  cases 
are  a  function  of  the  number  of  sensors  sampled,  not  their  type. 

Perhaps  most  related  to  our  work  is  that  of  Wang  et  al.  [27], 
They  propose  a  method  to  find  the  optimal  set  of  sensors  to  be 
polled,  using  a  hybrid  tree,  where  non-leaf  nodes  act  as  a  deci¬ 
sion  tree  and  leaves  are  standard  regression  models  using  a  subset 
of  sensors.  However,  these  trees  support  decision  making  based  on 
external  constraints,  i.e.  which  sensors  to  use  depending  on  an  orga¬ 
nization’s  goals  and  resources.  In  contrast,  our  models  are  intended 
to  support  sensor  control  in  RCSS  during  deployments. 

Outside  of  the  adaptive  sampling  and  sensor  fusion  fields,  multi¬ 
objective  optimization  has  been  used  for  cost-sensitive  modeling. 
For  example  Kim  [11]  set  error  as  one  objective  and  tree  size  as 
another,  as  we  do  here.  Zhao  [30]  sets  the  false  negative  rate  and 
false  positive  rate  as  the  two  objectives.  However,  these  works  do 
not  consider  the  hierarchical  approach  that  we  do. 


3.  METHODS 

This  section  provides  a  formalization  of  the  problem,  how  ge¬ 
netic  programming  is  applied  to  solve  it,  and  the  two  variants  of 
genetic  programming  that  we  compare  in  this  work.  All  of  the  ma¬ 
terial  for  replicating  the  work  described  here  is  available  online  [1], 

3.1  Problem  Formalization 

Let  us  assume  that  t  values  of  some  environmental  phenomenon 
g  (the  ground  truth)  are  known  at  time  steps  1, . . .  t.  These  values 
are  stored  in  g  =  g\,...gt.  Let  us  further  assume  there  are  k 
sensors  si, . . .  Sk  available  that  can  be  used  to  predict  g.  Let  r\*i 
denote  the  reading  of  sensor  i  taken  at  time  t.  Moreover,  let  s® 
and  r  ^  denote  a  subset  of  sensors,  and  readings  taken  from  them, 
at  time  t.  We  denote  the  amount  of  variance  of  g  explained  by 
sensor  i  as  v£f\  This  value  is  determined  by  linearly  regressing 
only  n  against  g.  Finally,  let  d  =  100(1  —  vif1 )  and  a  represent 
the  prediction  error  and  cost  of  using  sensor  i  respectively.  Using 
this  formulation,  a  represents  the  percentage  of  prediction  error 
incurred  by  just  using  sensor  i  to  predict  g. 

The  cost  of  a  sensor  a  is  usually  inversely  proportional  to  its  er¬ 
ror  e,,  so  for  the  work  reported  below,  we  set  d  =  for  each 
sensor.  In  certain  sensor  deployments  there  may  be  other  factors 
that  affect  a  such  as  power  consumption,  market  price,  effort  re¬ 
quired  to  collect  a  sensor’s  reading,  proprietary  issues,  and  so  on. 

We  suppose  that  an  ordering  of  sensors  exists  such  that  si  is  the 
least  expensive  sensor  with  the  highest  error  and  Sk  is  the  most 
expensive  sensor  with  the  lowest  error.  Formally, 

Vi,  j  .  1  <  i  <  j  <  k  — >  ej  >  ej  A  d  <  Cj. 

Let  us  denote  the  prediction  of  a  model  using  a  subset  of  sensors 
at  time  f  by  pV),  j  e_  .p(t>  (s  a  function  on  r(f).  Then,  the  error  of 
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each  sampling  e(t>  would  be 

e(t)  4  \pM  -gM\. 

The  cost  of  each  sampling,  c^1  is  the  cumulated  cost  of  all  sen¬ 
sors  Si  £  s(t)  that  were  polled  for  that  sampling: 


It  is  desired  that  each  sampling  entails  low  error  and  cost. 
That  is,  the  following  equality  is  desirable: 

(t)  •  (t) 

argmin  e  =  argmin  c  . 

s(t)  s(t) 

Our  goal  is  to  design  models  which  combine  and  transform  sen¬ 
sor  readings  to  accurately  predict  the  outcome  measure,  but  can 
also  intelligently  determine  which  sensors  to  poll  when  cheap,  less 
accurate  sensors  exhibit  uncertainty  about  the  current  prediction. 

3.2  General  Genetic  Programming  approach 

Genetic  programming  has  widely  been  employed  for  regression 
tasks  in  which  the  functional  form  of  the  equations  relating  inputs 
to  outputs  is  unknown.  Here,  inputs  are  sensor  values  and  output  is 
a  prediction  for  a  given  outcome  measurement. 

Although  many  recent  improvements  have  been  proposed  for  GP, 
here  we  have  kept  the  genetic  programming  algorithm  simple  and 
instead  focused  on  comparing  GP-generated  hierarchical  and  non- 
hierarchical  models.  Thus,  GP  is  restricted  to  the  four  simple  al¬ 
gebraic  operators,  and  each  evolutionary  trial  is  initialized  with  a 
fixed-sized  population  of  randomly-generated  solutions  containing 
three  nodes.  Maximum  tree  depth  is  not  set  since  the  tree  size  is 
considered  as  an  objective  in  multi-objective  optimization.  The 
crossover  rate  is  set  to  0.2  and  no  fitness  stall  is  considered.  If  the 
number  of  non-dominated  solutions  reaches  50%  of  the  population 
size,  the  training  restarts.  At  the  conclusion  of  each  generation, 
four  values  are  computed  for  each  solution:  (1)  error  on  training 
data  as  defined  below,  (2)  the  combined  cost  of  the  sensors  used  to 
make  the  prediction,  (3)  the  size  of  the  solution,  and  (4)  the  age  of 
the  solution.  We  now  discuss  each  in  turn. 

Error:  Let  n  be  the  population  size  and  j  range  over  {1,  ■  ■  •  ,  n}. 
Let  tj  be  some  solution  tree.  We  represent  the  error  of  sampling  at 
time  t  using  solution  tj  with  .  Moreover,  d,ram  and  d,est  denote 
the  training  dataset  and  testing  dataset,  respectively.  Then,  we  de¬ 
fine  the  error  on  training  data  using  solution  t:l  by  e'™n  and  as  the 

average  of  e on  all  samples  in  d,m"\  i.e., 

eW 

train  _A  V  ‘  L 

%/i  /  .  I  f/ir.im  I 

gW  edtrain 

Each  solution  tj  was  allowed  to  use  a  subset  (possibly  empty)  of 
available  sensors.  The  cost  of  each  solution  depends  on  the  sensors 
that  are  employed  and  the  sampling. 

Cost:  As  described  in  the  following  sub-sections,  the  current 
readings  of  the  sensors  may  trigger  readings  from  additional  sen¬ 
sors.  Thus,  different  may  cause  tj  to  need  different  s^\  The 
average  cost  of  a  tree  on  training  data  c1™”  is  thus  defined  as  the 
cost  of  all  of  the  sensors  that  have  been  used  to  predict  the  out¬ 
come  for  each  training  instance,  averaged  over  all  instances  in  the 
training  data  set: 

train  Zi  \  '  \  '  Cl 

Ct3  —  Zj  2—t  I  strain  I  ' 

r(*)eritr»in  i6{*|si£s(*)} 


If  a  solution  uses  a  sensor  more  than  once,  no  extra  cost  is  incurred: 
because  the  sensor  has  already  been  polled,  its  output  is  already 
available  and  can  thus  be  re-used  as  often  as  required. 

Size:  To  avoid  bloat,  solution  size  was  incorporated  into  the  fit¬ 
ness  objectives  during  the  optimization  process  [7]. 

Age:  We  employed  the  Age-Fitness  Pareto  Optimization  (AFPO) 
method  [20],  which  injects  a  new  randomly-generated  solution  into 
the  population  at  each  generation  and  compares  the  solutions  with 
same  age  in  an  effort  to  guard  against  convergence.  Each  solution’s 
age  is  defined  as  the  number  of  generations  since  its  oldest  ances¬ 
tor  was  injected  into  the  population.  A  new  solution  produced  by 
mutating  an  existing  solution  inherits  the  same  age  as  its  parent.  If 
two  existing  parents  are  crossed  to  produce  two  new  offspring,  the 
offspring  inherit  the  age  of  the  older  of  the  two  parents.  AFPO  is 
an  multiobjective  optimization  method  as  solution  age  is  used  as  an 
additional  fitness  objective  during  optimization. 

Optimization.  At  the  end  of  each  generation,  the  Pareto  front 
is  computed  according  to  the  objectives  used,  and  the  dominated 
solutions  are  discarded.  Multi-objective  optimization  with  all  four 
objectives  described  above  could  easily  lead  to  population  collapse 
in  the  sense  that  all  members  of  the  population  could  become  non- 
dominated.  To  guard  against  this  eventuality,  one  possibility  is  to 
restart  the  evolutionary  run  with  new  solutions  if  no  dominated  so¬ 
lutions  are  detected  in  the  population  at  the  end  of  a  given  genera¬ 
tion.  Alternatively,  a  very  large  population  size  can  be  employed. 
However,  both  of  these  solutions  greatly  increase  the  computational 
effort  required  to  obtain  satisfactory  solutions  to  the  given  prob¬ 
lem.  To  avoid  this  situation,  different  multi-objective  optimization 
approaches  has  been  proposed.  One  of  the  simplest  non-parametric 
approaches  is  to  reduce  the  number  of  objectives  by  multiplying 
objectives  together  and  using  the  result  in  the  optimization  process 
[10].  In  this  experiment,  since  error  is  the  most  important  outcome, 
error  is  used  for  the  primary  objective  and  the  second  objective  is 
the  result  of  multiplying  cost,  size  and  age  together. 

Once  the  dominated  solutions  are  deleted,  the  empty  slots  in  the 
population  are  then  filled  by  mutating  and  crossing  copies  of  the 
non-dominated  solutions.  Tournament  selection  is  used  to  select 
parents  from  the  front  for  these  operations.  After  the  last  genera¬ 
tion,  age  is  discarded  when  computing  members  of  the  Pareto  front, 
since  the  goal  is  to  use  only  small,  accurate  and  cost-effective  solu¬ 
tions  for  prediction,  regardless  of  their  age. 

3.3  Non-hierarchical  GP 

A  naive  approach  to  cost-sensitive  modeling  using  GP  would  be 
to  evolve  individual  trees  that  add  conditional  and  comparative  op¬ 
erators  to  the  base  set  of  operators,  and  allow  the  tree  to  poll  the  val¬ 
ues  of  all  sensors  if  desired,  as  shown  in  Figure  l.A.  In  this  way,  dif¬ 
ferent  parts  of  the  solution  tree  will  be  visited  depending  on  the  cur¬ 
rent  values  of  the  sensors.  Successful  solutions  may  evolve  which 
only  visit  nodes  containing  references  to  expensive  sensors — which 
are  then  polled — if  less  expensive  sensors  report  certain  combina¬ 
tions  of  values  that  signal  these  sensors  are  unlikely  to  predict  well 
given  the  current  circumstances,  e*™"  and  c*™"  *  age  *  size  are  em¬ 
ployed  as  the  two  main  objectives  in  the  optimization  process. 

Figure  1.B  shows  an  hypothetical  example  of  a  GP  solution  tj 
that  has  evolved  to  encode  a  useful  conditional.  In  this  example, 
an  inexpensive  sensor  si  is  first  polled.  If  its  reported  value  r[t> 
is  below  some  threshold,  the  reading  of  a  more  expensive  sensor 
s 2  is  going  to  be  used.  It  is  assumed  here  that  si  tends  to  make 
poor  predictions  of  the  outcome  if  its  reading  is  below  1.43.  If  this 
threshold  is  exceeded,  r[tr>  is  then  used  to  predict  the  outcome. 

Conditional  operators  should,  indirectly,  encode  the  differential 
effects  on  the  available  sensors,  and  the  relative  costs  of  those  sen- 
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Figure  l:A)Non-hierarchical  framework.B)A  non-hierarchical 
sample  solution. 


Figure  2:  Hierarchical  framework. 


sors.  Note  that  this  is  possible  even  if  GP  does  not  have  direct 
access  to  these  differential  effects  and  costs,  as  they  are  indirectly 
reflected  in  the  errors  and  costs  incurred  when  each  solution  is 
evaluated.  This  issue  is  worth  mentioning  in  that  these  effects  are 
complex,  non-linear  and  noisy,  and  even  field  experts  cannot  define 
them  precisely. 

3.4  Hierarchical  GP 

An  alternative  approach  to  reconciling  prediction  error  and  pre¬ 
diction  cost  is  to  build  a  hierarchy  of  models:  models  in  the  lower 
layers  only  have  access  to  inexpensive  sensors,  while  models  in  the 
upper  layers  have  access  to  a  greater  subset  of  the  sensors,  includ¬ 
ing  more  expensive  ones.  When  deployed,  the  overall  model  re¬ 
turns  a  prediction  from  a  lower  layer  if  the  inexpensive  sensors  are 
confident  of  their  combined  prediction.  If  they  are  not,  predictions 
are  drawn  from  a  higher  layer. 

Briefly,  constructing  such  a  model  proceeds  in  two  phases: 

1.  First,  build  a  set  of  k  layers,  one  for  each  sensor  modal¬ 
ity.  For  each  layer  i ,  run  GP  to  find  a  set  of  accurate  and 
low-cost  solutions  that  use  one  or  more  sensors  from  the  set 

Si,  S2,  ■  ■  ■  Si. 

2.  Define  conditions  which  determine  which  layer  should  be  al¬ 
lowed  to  provide  the  prediction,  given  the  current  environ¬ 
mental  conditions. 

Figure  2  illustrates  what  such  a  hierarchical  model  looks  like. 
At  the  outset  of  attempting  to  provide  a  prediction  for  the  current 
environmental  conditions,  the  models  stored  in  the  lowest  layer  are 
evaluated,  which  only  have  access  to  the  least  expensive  sensor  si. 


If  the  certainty  of  their  combined  predictions  is  acceptable,  return 
the  combined  prediction  of  these  models.  Otherwise,  evaluate  the 
models  at  the  next  layer,  which  have  access  to  si  and  the  next  least 
expensive  sensor  S2 .  If  these  models  are  acceptably  confident  in  the 
prediction,  return  their  combined  prediction;  otherwise,  evaluate 
the  solutions  at  the  next  layer,  and  so  on.  If  the  top  layer  is  reached, 
the  combined  predictions  of  the  models  found  there  are  returned 
as  the  overall  prediction,  regardless  of  their  level  of  certainty.  The 
incremental  construction  of  these  models  is  described  next. 

Starting  with  the  least  expensive  sensor  si,  GP  is  used  to  find 
the  best  models  for  converting  to  g(t\  When  GP  terminates, 
the  final  non-dominated  solutions  are  then  organized  as  a  group 
named  layer  L\.  The  same  process  is  repeated  for  S2,  except  for 
the  fact  that  since  si  is  already  polled  in  L\,  it  may  be  incorporated 
into  models  during  evolution  without  incurring  an  extra  cost  for  the 
solution  tree  that  makes  use  of  it.  Similarly,  for  each  sensor  Si, 
a  separate  GP  run  is  performed  with  sensors  si  to  Si  available  as 
input  to  construct  layer  Li.  These  layers  are  then  organized  in  a 
hierarchical  fashion.  The  order  of  layers  is  based  on  the  cost  of 
the  most  expensive  sensor  they  are  representing,  from  L\  to  Lk- 
Suppose  each  layer  Li  consists  of  rn  solutions  and  the  yth  solution 
tj  in  Li  is  denoted  as  tij.  Let  Vt-  j  denote  the  prediction  of  <]  t> 
that  tij  provides.  Then,  the  final  prediction  of  layer  Li  for  is 


rii 


The  error  that  corresponds  to  p ^ 1  is 


A 


A  |  ( t ) 

=  IPl  -  9 
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In  the  second  phase,  a  conditional  must  be  formulated  to  de¬ 
termine  whether  the  current  layer  should  return  its  prediction,  or 
whether  the  burden  of  prediction  should  be  passed  up  to  the  next 
layer.  One  common  method  for  measuring  how  confident  an  en¬ 
semble  of  models  is,  is  to  compute  the  variance  in  their  predictions 
[21]:  if  variance  is  low,  and  those  models  are  sufficiently  indepen¬ 
dent  of  one  another,  there  is  a  greater  likelihood  that  their  combined 
predictions  can  be  trusted.  If  variance  is  high,  this  is  likely  the  re¬ 
sult  of  differing  assumptions  encoded  in  the  models,  which  cannot 
all  be  true  reflections  of  the  hidden  relationship  being  modeled. 
Note  the  assumption  here  that  the  models  are  relatively  indepen¬ 
dent:  a  set  of  identical  models  will  never  exhibit  a  variance  in  their 
predictions,  regardless  of  how  accurate  the  individual  models  are. 
We  can  be  somewhat  confident  of  the  independence  of  our  models, 
as  they  are  produced  by  the  AFPO  algorithm:  models  with  differing 
ages  are  likely  to  arrive  on  the  final  Pareto  front  used  to  build  each 
layer,  and  such  differently-aged  genomes  are  likely  to  be  somewhat 
independent  because  of  their  different  genetic  origins. 

Formally:  Let  p£am(t)  andej“n(t)  denote  p^_  and  e!'}'}  using  on 
d}Tmn,  respectively.  Similarly,  p“st<t)  and  e'2st(t)  respectively  denote 
p and  e®  using  on  d,est.  Moreover,  assume  t;*™1®  and  v‘est(t) 
are  the  variances  of  allp^.s  on  dtram  and  dtest.  Also,  v denotes 
^tram(t)  averagecj  over  ap  the  samplings  in  d1™". 

To  determine  whether  the  burden  of  prediction  should  remain 
with  the  current  layer  or  passed  off  to  a  higher  layer,  we  measure 
the  difference  in  prediction  variance  between  the  models  when  pre¬ 
sented  with  the  training  data  or  with  the  testing  data,  i.e.  the 
current  environmental  conditions  (i>*est<t)).  When  ir65'®  is  almost 
the  same  as  it*™",  there  is  a  high  probability  that  e“st(t)  is  an  ap¬ 
proximation  of  e^alnCt),  and  we  can  be  relatively  confident  that  these 
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Table  1:  Available  sensors  and  their  features. 
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Figure  3:  Using  the  difference  between  training  data  prediction 
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variance  and  test  data  prediction  variance  as  the  condition  for 
switching  between  model  layers. 


models  will  yield  a  good  collective  prediction  of  glt\  When  the 
variance  of  test  data  prediction  is  significantly  higher  than  predic¬ 
tion  on  the  training  data,  this  signals  that  the  solutions  in  that  layer 
are  exhibiting  increased  disagreement  regarding  the  current  envi¬ 
ronmental  conditions.  This  could  be  due  to  the  fact  that  a  specific 
sensor  is  not  physically  able  to  predict  under  the  current  conditions, 
or  the  solutions  have  not  been  trained  for  the  current  situation.  In 
such  an  eventuality  it  would  be  advantageous  to  switch  to  the  next 
layer,  in  the  hope  that  its  models  will  exhibit  more  confidence  in 
their  ability  to  predict  the  current  conditions.  In  this  paper,  the 
variance  is  considered  as  a  proxy  for  entropy,  but  any  other  entropy 
related  metric  could  be  used  instead.  Figure  3  illustrates  how  this 
intuition  is  encoded  into  the  switching  condition  in  the  hierarchy  of 
layers. 

By  considering  the  amount  of  difference  between  prediction  vari¬ 
ance  on  training  and  testing  data,  we  can  dynamically  tune  how 
conservative  or  liberal  the  overall  hierarchical  model  is:  if  little  dif¬ 
ference  is  tolerated,  the  burden  of  prediction  will  often  be  passed 
to  higher  layers,  resulting  in  expensive  yet  accurate  predictions; 
if  much  difference  is  tolerated,  lower  levels  will  tend  to  predict, 
resulting  in  less  expensive  and  less  accurate  predictions.  The  ad¬ 
vantage  of  this  approach  is  that  the  amount  of  tolerance  could  be 
dynamically  tuned  based  on  the  current  available  budget  for  sens¬ 
ing. 

For  example,  for  larger  budgets,  more  cost  could  be  expended  in 
order  to  obtain  more  accurate  results.  In  this  regard,  the  tolerance  of 
the  difference  between  variances  could  be  decreased,  transferring 
the  burden  of  prediction  to  higher  layers.  Similarly  for  small  bud¬ 
gets,  the  tolerance  would  be  increased.  Through  this  adjustment, 
more  disagreement  would  be  tolerated  and  less  accurate  predictions 
would  be  obtained  for  lower  cost.  To  implement  this  dynamic  tun¬ 
ing  given  a  fluctuating  budget,  a  tolerance  parameter  re  [0, 1]  is 
defined,  reflecting  the  tolerance  of  disagreement  between  the  so¬ 
lutions  of  a  given  layer.  Equation  (1)  demonstrates  how  this  pa¬ 
rameter  is  used  to  determine  which  level  should  be  activated  for 
prediction. 


ft) 

p '  ’  = 


(t) 

Pl{ 

(A 

PlL 


ifufin  <  |1 
otherwise 


test(t) 


CD 


It  should  be  noted  that  in  the  present  work,  the  same  value  for  r  is 
used  at  the  interstices  between  each  pair  of  layers.  However,  dif¬ 
ferent  values  for  r  could  be  employed  between  different  layers  to 
enable  the  model  to  respond  better  to  changes  in  the  overall  avail¬ 
able  budget.  The  extreme  cases  occur  when  r  =  0  or  r  =  1. 
The  former  ensures  that  the  conditional  is  only  true  when  the  pre¬ 
diction  variance  on  the  testing  data  is  greater  than  the  prediction 
variance  on  the  training  data  which  has  a  high  probability  of  oc¬ 
curring.  Thus,  the  method  tends  to  extract  the  predictions  from  the 
solutions  on  the  uppermost  layer.  The  latter  ensures  that  the  condi¬ 
tional  is  only  true  whenever  the  variance  on  the  testing  data  is  finite, 
which  is  always  true.  In  this  case,  the  first  layer  always  provides  the 


prediction.  Values  greater  than  r  =  1  are  not  investigated  in  this 
work,  but  are  possible.  Greater  r  value  increases  the  probability  of 
the  conditional  to  be  true,  r  =  oo  causes  the  conditional  to  always 
be  true,  thus  the  method  always  collects  predictions  from  the  last 
layer. 

4.  RESULTS 

The  proposed  methods  are  evaluated  over  two  set  of  experiments, 
using  a  synthesized  dataset  and  ten  actual  datasets.  This  section 
summarizes  these  datasets,  experimental  setups,  and  quantitative 
results. 

4.1  Synthesized  Data 

In  these  experiment,  the  proposed  methods  have  been  evaluated 
on  a  synthetic  system  monitored  by  three  different  sensors.  Table 
1  shows  these  three  sensors,  their  readings  in  relation  to  g^\  and 
their  cost. 

To  create  the  training  and  testing  datasets,  at  first  coefficients 
in  the  equations  of  the  sensor  relations,  i.e.,  bij s,  were  randomly 
selected  in  the  range  [0, 1].  Then,  random  numbers  were  generated 
for  g and  used  to  calculate  the  sensor  readings  based  on  the 
given  template  and  selected  coefficients.  The  training  and  testing 
dataset  sizes  were  150  and  50,  respectively,  and  each  experiment 
were  repeated  40  times. 

N on-hierarchical  setup.  The  population  size  is  100  and  is  trained 
for  300  generations.  The  optimization  process  during  the  last  gen¬ 
eration  does  not  consider  age  as  an  objective  and  Pareto  front  is 
selected  using  error  and  cost  x  size  as  two  separate  objectives.  Af¬ 
ter  training,  the  knee  of  the  non-dominated  solutions  is  selected  and 
tested  using  the  testing  dataset. 

Hierarchical  setup.  The  population  size  for  each  layer  is  100 
and  each  layer  was  trained  for  100  generations,  for  the  sake  of 
fairness  in  comparisons.  Similarly  to  the  non-hierarchical  setup, 
during  the  last  generation,  age  is  not  considered  in  the  Pareto  opti¬ 
mization  process,  and  non-dominated  solutions  are  selected  based 
on  error  and  cost  x  size  as  two  separate  objectives.  After  training, 
for  each  layer  Li,  the  variance  of  the  solutions  output  on  train¬ 
ing  data  vT"  is  computed  and  stored  as  the  threshold  of  switch¬ 
ing  to  the  next  layer  Li+ 1.  This  variance  is  not  computed  for  the 
layer  corresponding  to  the  most  expensive  sensor,  i.e.,  L3,  since 
there  are  no  more  sensors  to  be  called.  The  experiment  was  re¬ 
peated  40  times  for  each  of  the  different  tolerance  parameters  r  = 
0.0,  0.1,  0.2,  0.4,  0.6,  0.8. 

4.1.1  Results  on  Synthesized  Data 

Average  error.  The  average  error  of  the  non-hierarchical  method 
is  e1”’,  where  tj  is  the  final  selected  solution.  The  average  error 
of  the  hierarchical  method  is  the  average  of  e'£sl,  where  Li  is  the 
last  layer  reached  in  the  hierarchy,  during  the  sampling.  As  can  be 
seen  in  Figure  4,  the  largest  difference  in  error  occurs  at  maximum 
tolerance  i.e.  r  =  0.8  where  the  error  of  the  hierarchical  method  is 
1.34%  higher  than  the  non-hierarchical  method.  The  hierarchical 
method  tends  to  achieve  lower  average  error  when  the  tolerance 
parameter  is  r  <  0.4.  P-values  obtained  for  different  tolerance 
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NH  method  and  H  method  woth  different  tolerance  parameters 


Figure  4:Average  error  on  the  test  data  for  the  non-hierarchical 
and  the  hierarchical  methods  with  different  tolerance  parame¬ 
ters.  Statistical  significance  of  these  results  are  represented  in 
Table  2.A) 


Table  2:  A)  P-values  considering  error  of  the  non-hierarchical 
and  the  hierarchical  methods  with  different  tolerance  parame¬ 
ters.  B)  P-values  considering  cost  of  the  non-hierarchical  and 
the  hierarchical  methods  with  different  tolerance  parameters. 


A 

P-values 

B 

P-values 

r  =  0.8 

0.013414 

r  =  0.8 

<  0.001 

r  =  0.6 

0.046626 

r  =  0.6 

<  0.001 

r  =  0.4 

0.635566 

r  =  0.4 

<  0.001 

r  =  0.2 

0.001309 

r  =  0.2 

<  0.001 

r  =  0.1 

<  0.001 

r  =  0.1 

<  0.001 

r  =  0.0 

<C  0.001 

r  =  0.0 

<  0.001 

parameters  are  represented  in  Table  2.A)  and  show  that  r  =  0.4  is 
the  boundary  where  the  hierarchical  method  begins  to  outperform 
the  non-hierarchical  method. 

Average  cost.  By  considering  tj  as  the  final  selected  solution 
in  the  non-hierarchical  method,  the  average  cost  is  c“st.  The  aver¬ 
age  cost  of  the  hierarchical  method  is  the  average  of  c’f*1,  when  the 
last  layer  reached  during  the  sampling  is  Li.  In  order  to  compare 
both  methods  and  understand  how  much  of  the  potential  cost  each 
method  uses,  the  cost  of  each  method  is  represented  as  the  percent¬ 
age  of  cost  of  using  all  available  sensors.  Figure  5  shows  that  the 
average  cost  of  the  hierarchical  method  is  significantly  lower  than 
the  non-hierarchical  method  (at  most  54.88%  and  at  least  33.81% 
lower  cost).  Table  2.B)  summarizes  the  p-values  to  show  how  sig¬ 
nificantly  the  cost  of  the  hierarchical  method  is  lower  than  the  non- 
hierarchical  method. 

4.2  Actual  Data 

In  this  experiment,  ten  datasets  are  selected  from  the  UCI  database 
repository  [2]  based  on  the  number  of  instances  and  features  from 
the  regression  section.  Table  3  summarizes  these  datasets  and  their 
features.  For  these  datasets,  we  treat  the  individual  features  as  in¬ 
dividual  sensors.  Each  experiment  in  this  section  were  repeated  30 
times. 

In  order  to  determine  the  accuracy  of  each  sensor  Si  in  predicting 


Figure  5:  The  average  cost  on  the  test  data  for  the  non- 
hierarchical  and  the  hierarchical  methods  with  different  tol¬ 
erance  parameters.  Statistical  significance  of  these  results  are 
represented  in  Table  2.B) 


Table  3:  Used  UCI  datasets. 


DSNo. 

DS  Name 

No.  of  Instances 

No.  of  sensors 

g ^  Average 

DSi 

Auto  MPG 

398 

7 

23.51457 

ds2 

Housing 

506 

13 

22.53281 

DSs 

Forest  Fires 

517 

12 

0.031663 

ds4 

Energy  Efficiency 

768 

8 

22.3072 

DS„ 

Concrete  Compressive  Strength 

1030 

8 

35.81796 

DS6 

Solar  Flare 

1389 

9 

0.300188 

ds7 

Airfoil  Self-Noise 

1503 

5 

124.8359 

DSs 

SkilCraftl  Master  Table  Dataset 

3395 

19 

4.184094 

ds9 

Wine  Quality 

4898 

11 

5.877909 

DS10 

Parkinson’s  Telemonitoring 

5875 

17 

29.01894 

Table  4:  Value  of  vlf ;  for  all  the  sensors  of  Auto  MPG  dataset. 


DS  No. 

Sl 

S2 

S3 

S4 

S5 

S6 

S7 

Auto  MPG 

0.1766 

0.3175 

0.3356 

0.5951 

0.6012 

0.6467 

0.6918 

Table  5:  Minimum  and  maximum  amount  of  variance  a  sensor 
accounts  for,  in  each  dataset. 


DS  No. 

•  (g) 

min  Vr 7 

(g) 

max  tir  T 

nS\ 

0.1766 

0.6918 

DS-2 

0.0307 

0.5441 

ds3 

0.0002 

0.2578 

DSi 

0.0076 

0.7911 

ds5 

0.0112 

0.2478 

DSe 

0.000 

0.096 

ds7 

0.0157 

0.1527 

DSs 

0.0005 

0.4542 

DSg 

0.0001 

0.1897 

DSio 

0.0037 

0.0263 

g ^ ,  the  value  of  vif 1  is  calculated  for  each  available  sensor  of  each 
dataset,  using  linear  regression.  The  greater  is,  the  better  that 
sensor  can  predict  gW  .Table  4  summarizes  the  values  of  vlf  ^  for  all 
of  the  sensors  of  the  Auto  MPG  dataset,  as  an  exampl.  We  define 
the  cost  of  each  sensor  in  these  datasets  as  Vrf> . 
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Table  6:  Average  error  percentages  and  the  corresponding  P- 
values  for  the  hierarchical  and  the  non-hierarchical  methods. 


DSNo. 

NH:  error  % 

H:  error  % 

P-value 

DSi 

20.85 

25.81 

<  0.001 

ds2 

25.90 

28.93 

<  0.001 

ds3 

126.49 

202.12 

0.565 

DSa 

29.19 

36.70 

<  0.001 

DS5 

35.29 

39.68 

0.393 

DSe 

110.63 

111.08 

0.223 

ds7 

0.00 

0.00 

0.082 

ds8 

37.59 

28.65 

0.194 

DSg 

10.79 

10.67 

0.197 

DS10 

32.11 

29.88 

0.423 

Table  7:  Average  cost  percentages  and  the  corresponding  P- 
values  for  the  hierarchical  and  the  non-hierarchical  methods. 


DSNo. 

NH:  cost  % 

H:  cost  % 

P-value 

DSj 

38.89 

12.33 

0.022 

ds2 

23.18 

1.26 

<  0.001 

DS3 

6.83 

15.81 

<  0.001 

DS4 

32.90 

4.18 

<  0.001 

DS5 

53.63 

28.63 

0.004 

DSe 

0.00 

0.98 

0.040 

ds7 

11.58 

7.35 

0.005 

DSs 

2.55 

0.00 

<  0.001 

DSg 

0.02 

0.00 

<  0.001 

DSW 

20.62 

3.88 

0.009 

Non-hierarchical  setup.  The  population  size  is  200  and  for  each 
dataset  with  k  features,  it  is  trained  for  200  *  k  generations. 

Hierarchical  setup.  The  population  size  for  each  layer  is  200. 
Similar  to  synthesized  data  experiments.  In  order  to  equalize  search 
effort  in  both  methods,  each  layer  was  trained  for  200  generation. 
After  training,  a  subset  of  the  non-dominated  solutions  with  least 
error  are  selected  and  organized  in  the  corresponding  layer.  The 
cardinality  of  this  subset  is  2%  of  the  population  size.  This  experi¬ 
ment  was  conductedof  for  tolerance  parameter  r  =  0.1.  This  value 
is  selected  based  on  the  results  in  4. 1  and  will  be  discussed  in  more 
detail  in  Section  5.1. 

4.2.1  Results  on  Actual  Data 

Average  error.  The  average  error  for  the  non-hierarchical  and 
the  hierarchical  methods  are  efst  and  e'£*‘  respectively,  where  tj 
is  the  final  selected  solution  in  the  non-hierarchical  method  and 
Li  is  the  last  layer  reached  during  the  sampling  in  the  hierarchical 
method.  Table  6  summarizes  the  average  error  of  both  methods  on 
all  the  datasets  as  a  percentage  of  error.  It  can  be  seen  that  for  some 
datasets,  the  average  error  of  the  hierarchical  method  is  higher  than 
the  average  error  of  the  non-hierarchical  method.  However,  the  P- 
value  for  the  two-tailed  f-test  shows  that  generally,  this  difference 
is  not  significant.  There  are  three  cases  where  the  difference  is 
significant  i.e.,  DSi ,  DS2  and  DS&. 

Average  cost.  Similar  to  4.1.1,  the  average  cost  is  represented  as 
the  percentage  of  the  maximum  possible  cost.  Table  7  summarizes 
the  percentage  of  the  average  cost  each  method  uses  for  predic¬ 
tion.  The  cost  of  the  hierarchical  method  is  significantly  lower  in 
all  cases  except  for  DS3  and  DSe- 

5.  DISCUSSION 

Our  results  in  all  experiments  suggest  that  the  hierarchical  method 


is  better  at  balancing  cost  and  accuracy  than  the  non-hierarchical 
approach.  We  believe  this  is  because  meaningful  sensor  control 
conditions  for  managing  cost  are  complex  and  require  consider¬ 
able  computational  effort  to  be  discovered.  Using  hand-tuned  pre¬ 
diction  uncertainty  to  drive  sensor  control  is  more  effective.  Fur¬ 
thermore,  our  results  show  that  the  latter  approach  better  supports 
dynamic  adaptation  to  changes  in  available  energy,  through  modu¬ 
lation  of  tolerance.  The  non-hierarchical  approach  cannot  adapt  to 
such  changes  without  retraining  from  scratch,  or  aggressive  online 
learning.  As  mentioned  in  Section  3.2,  in  these  experiments  a  basic 
genetic  programming  approach  was  deployed.  We  anticipate  that  if 
we  were  to  use  a  more  powerful  underlying  GP  approach,  the  error 
of  both  hierarchical  and  non-hierarchical  models  would  be  reduced. 

In  the  remainder  of  this  Section  we  discuss  results  as  they  pertain 
specifically  to  experiments  with  synthesized  and  actual  data. 

5.1  Synthesized  Data 

Average  error.  As  can  be  seen  in  Figure  4,  the  hierarchical 
method  achieved  significantly  better  accuracy  than  the  non-hierarchical 
method  for  r  <  0.4.  In  general,  results  show  that  higher  tolerance 
allows  the  algorithm  to  accept  more  uncertainty  in  the  prediction 
and  rely  on  less  expensive  sensors  which  are  less  accurate.  This 
avoids  the  use  of  more  expensive  sensors,  but  causes  average  error 
to  rise.  A  tolerance  of  r  <  0.4  is  apparently  the  threshold  where 
average  error  in  the  hierarchical  method  exceeds  that  of  the  non- 
hierarchical  method. 

Average  cost.  Results  reported  in  Figure  5  show  that  the  hi¬ 
erarchical  method  significantly  outperforms  the  non-hierarchical 
method  with  regard  to  cost  on  this  dataset,  even  when  tolerance  is 
low.  This  suggests  that  the  use  of  variance  in  ensemble  predictions 
to  serve  as  a  proxy  for  prediction  uncertainty  is  not  easy  to  learn, 
and  serves  as  a  good  mechanism  for  control.  Results  in  Figures  4 
and  5  suggest  that  r  =  0.1  is  a  “sweet  spot”  for  balancing  cost 
and  accuracy,  though  the  value  could  be  increased  or  decreased  if 
greater  frugality  or  accuracy  were  needed,  respectively. 

5.2  Actual  Data 

For  testing  on  actual  data,  we  fixed  r  =  0.1  due  to  results  on 
synthetic  data  demonstrating  a  nice  balance  between  cost  and  ac¬ 
curacy  with  this  tolerance  level. 

Average  error.  Table  6  shows  that  the  average  error  of  the  hi¬ 
erarchical  and  the  non-hierarchical  methods  were  not  significantly 
different,  except  for  datasets  DSi ,  DS2  and  DS4  where  the  latter 
method  achieves  better  prediction  accuracy.  This  is  probably  due  to 
the  characteristics  of  these  datasets,  where  the  difference  between 
the  least  prediction  variances  tyf  *  s  and  the  greatest  ones  is  large. 
The  majority  of  sensors  in  these  datasets  are  not  informative  but 
have  low  costs  and  the  remaining  sensors  are  informative  enough 
but  come  with  very  higher  costs.  Thus,  lower  levels  of  the  hierarchy 
“struggle”  compared  to  upper  ones  in  terms  of  accuracy.  Neverthe¬ 
less,  accuracy  rate  with  the  hierarchical  method  is  still  competitive 
even  in  these  cases,  and  cost  reduction  is  significant.  Also,  it  can  be 
seen  that  as  the  size  of  the  datasets  grows,  the  difference  between 
the  error  rate  of  the  non-hierarchical  and  the  hierarchical  methods 
decreases,  and  in  the  three  largest  datasets  the  hierarchical  method 
also  achieves  lower  error  rates. 

Average  cost.  The  hierarchical  method  achieved  significantly 
lower  cost  than  the  non-hierarchical  method  on  all  the  real  world 
datasets,  as  shown  in  Table  7,  except  for  DS3  and  DSe •  As  rep¬ 
resented  in  Table  5,  in  these  two  datasets,  just  a  small  subset  of 
sensors  are  relatively  informative.  Since  the  tolerance  parameter 
for  the  hierarchical  method  is  low,  the  hierarchical  method  employs 
more  informative  sensors.  Taken  together,  results  shown  in  Tables 
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6  and  7  clearly  indicate  an  advantage  of  the  hierarchical  method  for 
balancing  cost  and  accuracy. 

6.  CONCLUSION  AND  FUTURE  WORK 

All  resource  constrained  sensor  systems  have  to  face  a  trade-off 
between  measurement  accuracy  and  the  cost  of  sensor  sampling.  In 
networks  supporting  multiple  sensor  types,  it  is  therefore  desirable 
to  develop  cost-sensitive  control  algorithms  that  sample  more  ex¬ 
pensive  sensors  only  when  necessary.  In  this  paper,  a  hierarchical 
method  is  proposed  where  GP  solutions  are  sorted  in  a  hierarchy  of 
layers  based  on  the  cost  of  the  sensors  they  use.  Switching  to  the 
next  more  expensive  layer  takes  place  only  if  the  prediction  vari¬ 
ance  indicates  uncertainty  at  lower  layers.  We  compare  this  method 
to  a  non-hierarchical  GP  method  where  cost  is  treated  as  an  addi¬ 
tional  optimization  objective  in  fitness  selection.  In  experiments 
using  a  synthesized  dataset  and  ten  real  datasets,  the  hierarchical 
method  is  shown  to  have  significantly  lower  prediction  costs  than 
the  non-hierarchical  method.  As  the  datasets  grow  bigger  and  more 
complex,  competitive  and  sometimes  lower  error  rates  are  achieved 
by  the  hierarchical  method.  Future  work  includes  consideration  of 
how  to  dynamically  tune  the  balance  of  cost  and  accuracy  based  on 
available  energy  and  budget.  Other  directions  for  future  work  in¬ 
clude  methods  for  online  learning  to  support  adaptation  of  control 
algorithms  to  particular  deployments,  and  application  of  hierarchi¬ 
cal  control  algorithms  in  real  resource  constrained  sensor  system 
deployments. 
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Infrastructure  for  the  automatic  collection  of  single-point  measurements  of  snow  water  equivalent  (SWE) 
is  well-established.  However,  because  SWE  varies  significantly  over  space,  the  estimation  of  SWE  at  the 
catchment  scale  based  on  a  single-point  measurement  is  error-prone.  We  propose  low-cost,  lightweight 
methods  for  near-real-time  estimation  of  mean  catchment-wide  SWE  using  existing  infrastructure,  wire¬ 
less  sensor  networks,  and  machine  learning  algorithms.  Because  snowpack  distribution  is  highly  nonlin¬ 
ear,  we  focus  on  Genetic  Programming  (GP),  a  nonlinear,  white-box,  inductive  machine  learning 
algorithm.  Because  we  did  not  have  access  to  near-real-time  catchment-scale  SWE  data,  we  used  avail¬ 
able  data  as  ground  truth  for  machine  learning  in  a  set  of  experiments  that  are  successive  approximations 
of  our  goal  of  catchment-wide  SWE  estimation.  First,  we  used  a  history  of  maritime  snowpack  data  col¬ 
lected  by  manual  snow  courses.  Second,  we  used  distributed  snow  depth  (HS)  data  collected  automatical¬ 
ly  by  wireless  sensor  networks.  We  compared  the  performance  of  GP  against  linear  regression  (LR),  binary 
regression  trees  (BT),  and  a  widely  used  basic  method  (BM)  that  naively  assumes  non-variable  snowpack. 
In  the  first  experiment  set,  GP  and  LR  models  predicted  SWE  with  lower  error  than  BM.  In  the  second 
experiment  set,  GP  had  lower  error  than  LR,  but  outperformed  BT  only  when  we  applied  a  technique  that 
specifically  mitigated  the  possibility  of  over-fitting. 

©  2015  Elsevier  B.V.  All  rights  reserved. 


1.  Introduction 

There  has  been  extensive  research  on  techniques  for  measuring 
and  modeling  snow  because  it  affects  many  hydrological,  atmo¬ 
spheric,  and  biological  processes  (Tappeiner  et  al„  2001).  The  accu¬ 
rate  estimation  of  snow  water  equivalent  at  the  catchment  scale  is 
useful  in  many  applications,  including  agricultural  planning, 
metropolitan  use,  flood  risk  evaluation,  planning  of  hydropower 
production  potential,  weather  forecasting,  and  climate  monitoring 
(Marofi  et  al.,  2011;  Schmucki  et  al.,  2014).  More  than  1/6  of  people 
globally  depend  on  seasonal  snow  or  glaciers  for  water  supplies 
(Bales  et  al.,  2006),  and  in  the  western  United  States  the  majority 
of  surface  water  resources  is  derived  from  snowmelt  (Serreze 
et  al.,  1 999).  However,  snow  has  declined  across  much  of  the  US  over 
the  last  half-century  (Pierce  et  al.,  2008).  The  current  severe  drought 
in  California,  with  record  low  snowpack  measurements  over  three 
years,  threatens  water  supplies  throughout  the  state  (Boxalla, 
2014)  and  highlights  the  importance  of  snowpack  research.  Snow 
both  influences  climate  and  responds  directly  to  climate  change 
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(Engeset  et  al.,  2004).  While  climate  change  warrants  increased 
snowpack  monitoring,  existing  techniques  perform  poorly  under 
extreme  climatic  conditions  (Molotch  et  al.,  2005;  Balk  and  Elder, 
2000),  and  it  has  been  argued  that  the  stationarity  of  hydrological 
processes  can  no  longer  be  assumed  (Milly  et  al.,  2008). 
Furthermore,  high  costs  of  data  gathering  constrain  the  temporal 
and  spatial  granularity  of  estimation  methods.  New  techniques 
are  needed. 

We  propose  new  low-cost  techniques  for  estimating  catch¬ 
ment-wide  snow  water  equivalent  using  machine  learning  algo¬ 
rithms,  especially  genetic  programming.  These  algorithms  use 
data  gathered  from  existing  sensor  infrastructure,  and  possibly 
short-term  deployments  of  wireless  sensor  networks.  The 
manipulation  of  large  data  sets  in  order  to  gain  insight  into  snow 
accumulation,  melt,  and  runoff  has  been  highlighted  as  a  necessary 
next  step  in  mountain  hydrology  (Dozier,  2011).  The  long-term, 
overarching  goal  of  our  research  project  is  to  achieve  better  near- 
real-time  (NRT),  estimation  of  SWE  at  the  catchment  scale.  By 
NRT,  we  mean  automated  reporting  at  fine-grained  timescales, 
for  example  hourly.  By  better,  we  mean  more  accurate  estimation 
without  significantly  increased  infrastructure  cost.  Our  strategy  is 
to  generate  snow  telemetry  datasets  using  short-term,  low-cost 
field  campaigns  that  can  be  used  by  machine  learning  algorithms 
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to  generate  snowpack  models.  Following  field  campaigns  and  the 
termination  of  associated  measurement  techniques,  these  models 
can  be  used  for  NRT  SWE  estimations  with  no  new  instrumentation 
overhead. 

The  key  idea  behind  our  approach  is  that  machine  learning 
models  are  able  to  induce  relationships  between  input  parameters 
and  an  output  value,  if  such  exist,  on  the  basis  of  the  ground  truth 
data  if  provided.  The  machine  learning  method  we  emphasize  is 
genetic  programming  (GP),  which  generates  equations  relating  a 
dependent  variable  to  a  set  of  independent  variables. 

In  our  case,  we  argue  that  if  we  obtain  multiple  years  of  “true” 
average  SWE  for  a  catchment,  machine  learning  will  be  able  to 
induce  a  meaningful  mathematical  relation  between  telemetry, 
such  as  proximal  snow  pillow  reading(s),  and  true  average  SWE. 
Then,  in  years  when  true  average  SWE  is  not  available,  inputs  such 
as  snow  pillow  readings  can  be  translated  into  average  SWE  esti¬ 
mates  for  the  catchment.  This  approach  assumes  interannual  conti¬ 
nuity  in  snow  distributions  over  a  catchment,  which  has  been 
demonstrated  by  previous  research  (Scipion  et  al.,  2013;  Tappeiner 
et  ah,  2001 ;  Schirmer  et  ah,  201 1 ).  Because  accurate  measurements 
of  mean  catchment  SWE  are  generally  unavailable  at  this  time,  we 
use  snow  course  and  wireless  sensor  network  data  as  proxies  for  true 
average  SWE  to  serve  as  ground  truth  for  machine  learning. 

Thus,  the  ideal  we  aim  for  is  a  generally  applicable  technique  for 
inducing  models  that  take  as  input  parameters  existing  infrastruc¬ 
ture  NRT  telemetry,  such  as  snow  pillow  readings,  meteorological 
data,  and  date/time  information,  and  output  measurements  of 
SWE  at  those  locations.  This  would  allow  more  accurate  SWE  esti¬ 
mation  to  be  provided  without  additional  cost  beyond  that  of  the 
initial  field  campaign  for  obtaining  a  ground  truth  dataset  (Fig.  1). 

Several  theoretical  and  practical  challenges  exist  on  the  way  to 
achieving  this  goal.  The  purpose  of  this  paper  is  to  address  them 
and  make  progress  in  three  particular  ways. 

First,  we  explore  the  issue  of  what  sort  of  machine  learning 
approaches  are  best  in  this  context.  In  general,  we  argue  that  tech¬ 
niques  that  are  able  to  model  nonlinear  relationships  are  needed 
due  to  the  known  nonlinear  nature  of  snow  distribution  in  alpine 
environments  (Tappeiner  et  al.,  2001;  Marofi  et  al.,  2011).  We  also 
argue  that  so-called  white-box  tools  are  best,  since  these  can  pro¬ 
vide  physical  insights  for  scientists  (Schmidt  et  al.,  2011). 
Furthermore,  we  emphasize  resiliency  against  over-fitting,  which 
is  especially  important  given  that  the  datasets  available  for 
machine  learning  may  be  relatively  small. 

Second,  we  investigate  what  sort  of  input  parameters  should  be 
used  by  SWE  estimation  models,  especially  in  light  of  practical  con¬ 
cerns,  i.e.  available  telemetry  and  datasets.  In  fact,  availability  of 
data  is  a  key  issue  in  this  effort,  and  defines  what  is  possible.  We 
acknowledge  the  importance  of  terrain  effects  in  determining 
snowpack  distribution,  influencing  both  accumulation  and  abla¬ 
tion  patterns  (Winstral  et  al.,  2013;  Fassnacht  et  al.,  2003;  Marks 


et  al.,  1999).  However,  because  all  snow  sensors  and  courses  are 
on  flat  or  nearly  flat  ground,  we  did  not  include  topographic  data 
as  explicit  inputs  to  our  models.  We  emphasize  the  flexibility  of 
inductive  machine  learning,  which  can  accommodate  arbitrary 
new  input  modalities.  Only  those  that  are  predictive  of  the  depen¬ 
dent  variable  of  interest  will  be  significantly  incorporated  into  the 
generated  models.  In  this  paper  we  focus  on  several  potential  snow 
telemetry  and  meteorological  inputs  in  order  to  demonstrate  the 
applicability  of  our  techniques  to  catchment-scale  SWE  estimation, 
while  considering  the  potential  for  future  work  to  explore  other 
inputs  such  as  topographic  data. 

Third,  we  grapple  with  the  issue  of  ground-truth  for  catchment- 
scale  SWE  and  usable  datasets.  Constraints  on  our  goal  were 
imposed  by  the  availability  of  snowpack  data.  We  are  not  aware 
of  catchment-wide  SWE  datasets  with  sufficiently  fine  time  granu¬ 
larity  to  support  our  ideal  scenario.  Although  datasets  such  as 
those  provided  by  the  Cold  Land  Processes  Field  Experiment 
(National  Snow  &  Ice  Data  Center,  2014)  and  numerous  others  pro¬ 
vide  catchment-scale  snowpack  measurements,  their  time  granu¬ 
larity  is  on  the  order  of  several  months  at  least.  Airborne 
techniques  in  general  are  cost-prohibitive  for  real-time  reporting 
(Biihler  et  al.,  2011).  Although  satellites  are  used  to  measure 
snow-covered  area  and  albedo  (Dozier  and  Painter,  2004),  satellite 
retrievals  of  SWE  are  not  feasible.  Manual  snow  courses  provide 
better  temporal  resolution  than  airborne  methods  (e.g.  biweekly) 
but  at  low  spatial  resolution:  snow  courses  measure  SWE  at  a  sin¬ 
gle  location.  We  highlight  the  Snowcloud  wireless  sensor  network, 
which  measures  HS  (an  effective  predictor  of  SWE )  in  NRT  (e.g. 
hourly)  at  multiple  locations  distributed  over  an  area  of  interest. 
However,  this  technology  is  new,  and  available  data  collected  by 
Snowcloud  deployments  is  limited. 

2.  Background  and  contributions 

Here  we  briefly  define  and  summarize  the  machine  learning 
methods  used  in  this  work.  These  techniques  are  described  in  more 
detail,  with  special  emphasis  on  GP,  in  Section  4.  The  basic  method 
(BM)  assumes  the  spatial  homogeneity  of  SWE.  It  naively  estimates 
mean  catchment-wide  SWE  to  be  the  same  as  the  single-point  SWE 
measurement  taken  at  a  snow  pillow.  Linear  regression  (LR)  fits  a 
least-squares  linear  model  to  training  data  (Hastie  et  al.,  2009). 
The  prediction  is  a  weighted  linear  combination  of  the  input  vari¬ 
ables.  Binary  regression  trees  (BT)  are  nonlinear  models  which  are 
generated  using  training  data  (Hastie  et  al.,  2009).  A  BT  model  par¬ 
titions  a  set  of  predictions  according  to  the  input  variables  such 
that  a  given  set  of  input  values  results  in  a  specific  prediction. 
Genetic  Programming  (GP)  is  a  symbolic  regression  algorithm  that 
uses  training  data  to  iteratively  improve  a  population  of  nonlinear 
models  through  a  combination  of  stochastic  variation  and  perfor¬ 
mance-based  selection  (Koza,  1992). 


Fig.  1.  First,  the  Snowcloud  WSN  is  deployed  in  an  area  near  a  snow  pillow.  Next,  data  generated  by  Snowcloud,  by  the  pillow,  and  potentially  other  sources,  are  used  by 
machine  learning  to  generate  a  model  of  snowpack  distribution.  Finally,  after  Snowcloud  has  been  removed,  the  model  is  used  to  estimate  snow  levels  in  the  area  where 
Snowcloud  had  been  deployed. 
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In  our  ideal  situation  we  would  use  a  large  set  of  accurate  mea¬ 
surements  of  mean  catchment  SWE  as  ground  truth  to  train  and 
evaluate  models  that  predict  mean  catchment  SWE  in  NRT. 
However,  the  only  SWE  measurements  available  at  this  spatial 
scale  are  generated  by  airborne  techniques  with  time  resolutions 
that  are  insufficient  for  machine  learning  (e.g.  twice  per  year). 
Because  machine  learning  needs  a  large  number  of  samples  for 
model  training  and  because  we  want  to  predict  SWE  in  near-real- 
time,  we  required  much  more  frequent  measurements.  We  there¬ 
fore  developed  a  series  of  experiments  using  available  snowpack 
data  in  lieu  of  NRT  catchment-scale  SWE  measurements  to  explore 
successive  approximations  of  our  ideal  scenario.  Approximations 
of  average  catchment  SWE,  obtained  via  snow  courses  and  dis¬ 
tributed  ground-based  sensor  readings,  sente  as  ground  truth  for 
machine  learning  in  our  experiments.  Implicit  in  our  work  is  the 
importance  of  new  methods  for  obtaining  NRT  catchment-scale 
SWE  ground-truthing  via  low-cost  distributed  sensor  networks. 
As  data  from  NASA's  Airborne  Snow  Observatory  (NASA  Airborne 
Snow  Observatory,  2015)  become  available  for  a  range  of  years, 
they  will  provide  an  ideal  data  set  for  our  approach. 

First,  we  used  snow  course  measurements,  which  involve  the 
manual  collection  of  SWE  and/or  HS  at  a  single  location,  as  a  proxy 
for  catchment-wide  SWE.  Although  snow  courses  do  not  directly 
measure  snowpack  distribution  at  the  catchment  scale,  they  are 
likely  to  provide  measurements  that  are  closer  to  mean  catchment 
SWE  than  snow  pillows  measurements  are.  Snow  courses  take  mul¬ 
tiple  measurements  over  approximately  200  m,  so  they  involve  a 
much  larger  sample  size  than  the  single-point  measurements  of 
snow  pillows.  Furthermore,  pillow  under-measurement  or  over¬ 
measurement  errors  may  occur  when  the  base  of  the  snow  cover 
is  at  melting  temperature  (Johnson  and  Marks,  2004).  Thus,  we 
used  snow  course  data  as  a  first  approximation  of  mean  catchment 
SWE  to  provide  ground-truth  data  for  machine  learning.  We  gener¬ 
ated  models  that  use  readily  available  information  such  as 
meteorological  telemetry  and  snow  pillow  measurements  as  input 
variables.  This  approach,  which  is  explored  in  Experiment  Set  1, 
would  allow  for  shorter  or  less  frequent  snow  courses  or  for  their 
discontinuation  and,  because  it  uses  previously  collected  data, 
incurs  no  data  gathering  costs. 

Second,  we  used  HS  data  collected  by  the  Snowcloud  (Skalka 
and  Frolik,  2014)  wireless  sensor  network  (WSN)  at  sites  in 
Norway  and  California,  each  for  only  one  snow  season,  as  a  proxy 
for  catchment-wide  SWE  data.  Snowcloud  is  a  WSN-based  data 
gathering  system  for  snow  hydrology,  notable  for  its  low-cost 
and  ease  of  deployment,  developed  and  operated  by  the 
University  of  Vermont.  A  network  of  light-weight  sensor  towers 
(nodes)  is  deployed  over  an  area  of  interest  for  a  short-term 
field  campaign  to  collect  spatially  distributed  measurements  of 
relevant  meteorological  processes  (Fig.  3).  In  addition  to  HS, 
Snowcloud  measures  air  temperature,  soil  temperature,  and  solar 
radiation.  Mesh  wireless  communication  allows  data  from  the 
entire  network  to  be  collected  wirelessly  by  communication  with 
a  single  node. 

We  used  measurements  collected  from  Snowcloud  over  the 
course  of  a  single  snow  season  to  generate  ground-truth  estimates 
for  model-training.  Note  that  it  could  be  desirable  to  collect  data 
over  multiple  seasons  as  models  trained  on  multi-year  data  may 
be  more  robust  against  internal-annual  variations  in  snowpack 
distribution.  Once  a  model  has  been  obtained,  the  WSN  may  be 
recovered  for  re-deployment  at  another  site.  Unlike  pillows  and 
snow  courses,  Snowcloud  collects  NRT  data  from  multiple 
locations,  potentially  capturing  more  of  the  variability  of  snowpack 
distribution  than  is  possible  with  single-location  measurements. 
Thus,  we  use  Snowcloud  data  as  a  second  approximation  of 
catchment  mean  SWE  to  provide  ground-truth  data  for  machine 
learning.  This  technique  is  explored  in  Experiment  Set  II. 


Recent  research  by  Kerkez  et  al.  (2012)  and  Welch  et  al.  (2013) 

has  developed  new  sensor  placement  strategies  for  monitoring 
snow.  Although  these  methods  were  not  employed  in  the  experi¬ 
ments  discussed  in  this  paper,  they  should  be  considered  in  future 
applications  of  our  techniques. 

2.3.  Suitability  of  machine  learning 

Snow  pillows  are  large,  expensive,  permanent  installations 
that  measure  SWE  at  a  single  location.  The  infrastructure  for  the 
automatic  collection  of  single-point  SWE  is  well  established.  For 
example,  there  are  830  Snowpack  Telemetry  (SNOTEL)  sites  in 
the  United  States  (Surveyor,  2014)  and  another  124  snow  pillows 
operated  by  the  California  Department  of  Water  Resources. 
However,  the  extrapolation  from  single-point  measurements  to 
surrounding  areas  is  error  prone.  The  spatial  distribution  of  alpine 
snow  cover  is  highly  variable  (Balk  and  Elder,  2000;  Elder  et  al., 
1991;  Jost  et  al.,  2007),  due  to  a  variety  of  environmental  forcing 
effects,  such  as  topography  (Anderton  et  al.,  2004),  canopy  cover 
(Moeser,  2010),  and  wind  and  solar  exposure  (Moeser,  2010; 
Moeser  et  al.,  2011). 

Merorny  et  al.  (2013)  studied  15  snow  stations  across  the  west¬ 
ern  United  States  and  found  that  snow  station  biases  were  fre¬ 
quently  greater  than  10%  of  the  surrounding  mean  observed 
snow  depth.  The  flat-field  areas  where  snow  pillows  are  commonly 
located  are  usually  not  typical  of  more  complex  nearby  terrain, 
causing  the  majority  of  such  stations  to  overestimate  snow  depth 
in  their  vicinity  (Griinewald  et  al.,  2013).  Molotch  and  Bales 
(2005)  studied  the  areas  surrounding  six  SNOTEL  stations  in  the 
Rio  Grande  headwaters.  They  found  that  only  a  small  fraction  of 
grid  elements  were  representative  of  mean  grid  SWE  during  accu¬ 
mulation,  and  that  no  elements  were  representative  of  mean  grid 
SWE  during  both  accumulation  and  ablation.  SNOTEL  stations  in 
the  Rio  Grande  headwaters  preferentially  represent  densely  forest¬ 
ed  areas  and  experience  snow  cover  persistence  that  is  14%  greater 
than  the  mean  persistence  of  the  watershed  (Molotch  and  Bales, 
2006).  Rittger  (2012)  found  that  errors  based  on  statistical  rela¬ 
tionships  between  point  measurements  of  snow  and  streamflow 
in  the  Sierra  Nevada  can  reach  25-70%  in  one  out  of  five  years. 

The  relative  importance  of  separate  processes  which  govern 
snow  distribution  varies  over  the  course  of  a  snow  season.  Elder 
et  al.  (1991)  summarize  the  various  processes  and  explain  how 
their  influence  changes  over  time.  During  the  winter,  accumulation 
and  redistribution  processes  dominate.  Precipitation  is  determined 
by  regional  climate  and  latitude  as  well  as  by  local  orographic 
effects,  and  redistribution  by  wind,  avalanches,  and  sloughs  are 
the  primary  causes  of  spatial  heterogeneity.  In  the  spring,  however, 
snow  distribution  is  controlled  mainly  by  ablation.  Of  the  many 
energy  sources,  solar  and  longwave  radiation  dominate.  This  ener¬ 
gy  decreases  water  in  a  basin  through  sublimation  and  when  run¬ 
off  leaves  the  basin.  It  also  redistributes  SWE,  affecting  spatial 
variability.  These  dynamics  highlight  the  need  for  NRT  modeling 
of  snowpack,  as  the  forcing  effects  that  establish  snow  distribution 
vary  drastically  over  the  course  of  a  snow  season. 

However,  the  significant  consistency  of  snowpack  between  years 
encourages  investment  into  the  development  of  reusable  statistical 
models.  Strong  inter-annual  consistency  in  the  spatial  distribution 
of  snow  (Scipion  et  al.,  2013),  in  SCA  (Tappeiner  et  al.,  2001 ),  and  in 
the  snow  depth  patterns  of  maximum  accumulation  (Schirmer 
et  al.,  2011),  have  been  observed  in  the  Swiss  and  Italian  Alps.  In 
the  western  United  States,  consistent  wind  directions  can  produce 
stable  snow  accumulation  patterns  from  year-to-year  (Winstral 
and  Marks,  2014).  These  findings  suggest  a  strong  link  between 
accumulation  patterns  and  geophysical  terrain  and  indicate  that 
site-specific  snow  distribution  models  may  be  able  to  accurately 
characterize  snowpack  distribution  over  multiple  years. 


53 

Approved  for  public  release;  distribution  is  unlimited. 


314 


D.  Buckingham  et  al./Joumal  of  Hydrology  5 24  (2015)  311-32 5 


Nevertheless,  long-term  changes  in  the  patterns  of  snow  distri¬ 
bution  may  be  caused  by  factors  such  as  changes  in  vegetation  or 
climate  change.  Therefore,  it  may  occasionally  be  necessary  to 
rerun  GP  and  generate  a  new  model.  Techniques  such  as  retroac¬ 
tive  SWE  calculation  (Rittger  et  al.,  2011)  could  be  used  to  detect 
when  previous  models  begin  to  perform  poorly,  indicating  that 
secular  variability  in  the  dynamics  of  snow  distribution  warrants 
the  development  of  a  new  model. 

It  may  be  desirable  to  produce  non-site-specific  models.  Trained 
at  catchments  where  ground  truth  data  is  available,  and  making 
use  of  predictor  variables  that  vary  between  catchments,  such  as 
topography,  such  models  could  then  be  applied  to  catchments 
where  no  independent  measurement  of  mean  catchment  SWE 
exists.  However,  we  did  not  incorporate  topography  because  the 
snow  pillows  are  all  on  flat  or  nearly  flat  ground.  Our  work  focuses 
on  site-specific  models  and  use  model  inputs  that  vary  over  time  at 
a  given  catchment. 

2.2.  Why  GP? 

It  has  been  demonstrated  that  the  relationships  between  snow 
distribution  and  the  topographic  and  meteorological  forcing  effects 
include  nonlinearities  (Tappeiner  et  al.,  2001),  and  the  spatial  dis¬ 
tribution  of  SWE  is  nonlinear  because  it  is  influenced  simultane¬ 
ously  by  numerous  processes  including  accumulation,  ablation, 
and  snow  drifting  (Marofi  et  al.,  2011).  GP  can  produce  both  linear 
and  nonlinear  models.  If  the  data  used  to  train  GP  contain  only  lin¬ 
ear  relationships,  the  resulting  models  will  be  linear,  and  the  per¬ 
formance  of  GP  will  be  similar  to  that  of  LR. 

White-box  models,  such  as  those  produced  by  GP,  can  be  inter¬ 
preted  by  human  analysis,  potentially  yielding  new  information 
about  the  modeled  data  (Schmidt  et  al.,  2011).  Some  nonlinear 
regressors,  such  as  artificial  neural  networks,  produce  models  that 
are  difficult  or  impossible  to  interpret.  GP  trees,  however,  can  be 
expressed  as  mathematical  equations  (Fig.  2).  It  is  possible  that 
by  examining  these  equations  domain  experts  could  gain  novel 
insight  into  the  processes  governing  snow  distribution. 

Unlike  regression  techniques  that  constrain  the  form  of  the 
regressor,  GP  can  combine  operators,  variables,  and  constants  into 
arbitrary  arrangements.  GP  does  not  require  any  assumptions 
about  the  form  that  a  model  should  take:  it  is  left  open  to  inductive 
search.  By  generating  models  that  use  predictor  variables  in 
unexpected  ways,  GP  may  help  discover  previously  unknown 
relationships  among  variables. 


Fig.  3.  Snowcloud  WSN  sensor  tower.  A  complete  sensor  stand  with  solar- 
recharged  battery  power,  wireless  mesh  communication,  and  multiple  sensor 
modalities.  October  2011,  Mammoth  Lake,  CA. 

Finally,  as  we  will  discuss  further,  GP  may  be  augmented  with 
multi-objective  optimization,  which  constrains  GP  to  produce  par¬ 
simonious  models.  This  mitigates  against  over-fitting,  a  significant 
concern  in  the  case  that  relatively  small  datasets  are  available  for 
machine  learning. 

While  many  regression  techniques  possess  one  or  more  of  these 
desirable  qualities,  GP  possesses  all  of  them,  making  it  an  ideal 
candidate  for  snowpack  modeling. 

2.3.  The  primacy  of  snow  depth 

While  SWE  is  a  product  of  HS  and  density  (p),  it  has  been  shown 
that  HS  is  the  essential  determining  metric  for  SWE  estimation. 
Models  have  been  developed  to  derive  p  estimates  from  HS  mea¬ 
surements  (Logan,  1973;  Sturm  et  al.,  2010),  and  measurements 
of  HS  are  highly  predictive  of  SWE  (Adams,  1976).  Analysis  of  the 
spatial  variability  of  HS  and  p  has  revealed  that  the  variability  of 
HS  is  significantly  greater  than  that  of  p  (Lopez-Moreno  et  al., 
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Fig.  2.  These  example  GP  trees  were  manually  selected  from  the  final  populations  of  GP  runs  conducted  for  Experiment  Set  II.  The  leftmost  tree  represents  a  simple  linear 
model.  The  middle  tree  is  a  nonlinear  model.  The  rightmost  tree  is  a  more  complex  nonlinear  model. 
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Generation  0 

y  =  ( log(x )  +  8.293)“2 
y  =  sin(x)  +  0.388 
y  =  (— x  —  0.319)x 
y  =  1.303  *  x^1'07') 

Generation  1 

y  =  sin(x)  +  0.388 
y  =  sin{  x  —  0.026)  +  0.388 
y=  1.303  *  x^1'07) 
y  =  0.912  *  *(*107) 

Generation  n 

y  =  cos(x  *  1.309)  —  (x0,501) 
y  =  ((x  —  0.026)  *  1.204)  +  0.388 
y=  (0.912  *  x^1'81))  —  0.441 
y  =  (7.337*  (x181))  -  8.139 


Fig.  4.  Genetic  programming  algorithm.  The  figure  on  the  left  demonstrates  the  iterative  process  through  which  GP  modifies  a  population  of  solutions.  On  the  right,  a 
population  of  four  models  evolves  as  each  iteration  of  the  GP  cycle  produces  a  new  generation. 


2012).  Variation  of  SWE  is  therefore  overwhelmingly  a  product  of 
HS  variation  (Moeser  et  al.,  2011;  Molotch  et  al.,  2005;  Sturm 
et  al.,  2010;  Elder  et  al.,  1991,  1998).  The  effect  of  p  variation  on 
SWE  is  small  by  comparison,  and  estimates  of  areal  SWE  derived 
from  one  or  several  SWE  measurements  can  be  greatly  improved 
by  incorporating  a  larger  number  of  HS  measurements  (Elder 
et  al.,  1998;  Moeser  et  al.,  2011),  which  are  much  less  labor  inten¬ 
sive  than  manual  SWE  measurements  (Sturm  et  al.,  2010). 
Snowcloud,  which  provides  ground-truth  data  Experiment  Set  II, 
measures  HS.  Therefore,  as  has  been  done  elsewhere,  we  use  HS 
as  a  “surrogate  for  SWE”  (Winstral  et  al.,  2002). 

2.4.  Related  work 

Moeser  et  al.  (2011)  explored  three  models  for  estimating  SWE 
in  the  area  around  a  meteorological  station  using  ground  based 
measurements.  The  first  model  used  meteorological  data  such  as 
air  temperature  and  solar  radiation,  tree  canopy  cover  measure¬ 
ments,  and  HS  measurements  collected  by  the  Snowcloud  WSN, 
as  well  as  a  single-point  SWE  measurement.  The  second  model 
used  multiple  HS  measurements  and  single-point  SWE  measure¬ 
ments,  but  no  meteorological  or  tree  canopy  data.  The  third  model 
used  meteorological  and  tree  canopy  data,  along  with  multiple  HS 
measurements,  but  no  single-point  SWE  measurement.  It  was 
found  that  increasing  the  number  of  HS  measurements  can 
improve  areal  SWE  measurements  because  HS  varies  more  than 
snow  density.  While  this  work  used  linear  modeling;  our  work 
expands  upon  it  by  developing  nonlinear  models. 

Marofi  et  al.  (201 1 )  compared  three  methods  for  modeling  SWE: 
multivariate  nonlinear  regression  (MNLR),  artificial  neural  net¬ 
works  (ANN),  and  a  neural  network-genetic  algorithm  (NNGA), 
where  genetic  algorithms  were  used  to  parameterize  ANNs  and 
the  learning  process.  ANN  performed  better  than  MNLR,  suggesting 
that  computational  intelligence  approaches  may  outperform  MNLR 
for  modeling  SWE.  NNGA  performed  better  than  ANN,  suggesting 
that  evolution-inspired  genetic  algorithms  can  be  used  to  develop 
effective  models  of  SWE.  Tabari  et  al.  (2010)  estimated  HS  and  SWE 
using  multiple  methods  and  also  found  that  NNGA  provided  the 
best  results.  Unlike  neural  networks,  GP  produces  white  box  models. 

Tappeiner  et  al.  (2001)  compared  the  performance  of  LR-based 
and  ANN-based  snowpack  models,  which  used  topographic  and 
meteorological  data  to  estimate  SWE.  The  authors  compared  the 


results  of  LR  with  ANN  to  estimate  the  degree  of  necessary  nonlin¬ 
earity  in  SWE  modeling.  The  ANN  performed  significantly  better 
than  LR,  demonstrating  nonlinearity  in  the  relationships  between 
topographic  and  meteorological  variables  and  SWE. 

Several  studies  have  used  binary  regression  trees  to  model 
snowpack.  Winstral  et  al.  (2002)  derived  terrain-based  parameters 
from  digital  elevation  models  (DEM)  which  were  used  as  input 
variables  to  binary  regression  trees.  One  parameter  was  based  on 
maximum  upwind  slopes  relative  to  seasonally  averaged  winds. 
Another  measured  upwind  breaks  in  slope  from  a  given  location. 
Binary  tree  models  based  on  these  terrain-based  parameters  as 
well  as  elevation,  solar  radiation,  and  slope  performed  better  than 
models  based  only  on  elevation,  solar  radiation,  and  slope.  Elder 
et  al.  (1998)  modeled  the  distribution  of  SWE  by  merging  remotely 
sensed  snow-covered  area  data  with  binary  tree  models  applied  to 
field  measurements  of  HS  and  SWE.  Balk  and  Elder  (2000)  com¬ 
bined  binary  regression  trees  with  kriging  of  manual  snow  survey 
measurements  and  snow-covered  area  determined  by  aerial  pho¬ 
tographs,  to  estimate  SWE.  Anderton  et  al.  (2004)  used  binary 
regression  trees  to  relate  HS  and  disappearance  date  to  terrain 
indices.  They  found  that  the  topographic  effects  on  snow  redistri¬ 
bution  by  wind  primarily  determined  SWE  distribution  at  the  start 
of  the  melt  season  which,  more  than  melt  rates,  determined  the 
patterns  of  snow  disappearance.  Molotch  et  al.  (2005)  compared 
binary  regression  tree  models  using  various  sources  of  DEMs  and 
found  that  using  DEMs  from  different  sources  leads  to  significant 
differences  in  modeled  snowpack  distribution.  The  most  significant 
differences  were  on  ridge-tops,  where  the  elevation  values  differed 
across  DEMs. 

In  Experiment  Set  II  we  compare  the  performance  of  BT  to  GP. 
Unlike  this  previous  work  which  used  binary  regression  trees  to 
produce  spatially  distributed  models  of  snowpack,  our  models  pre¬ 
dict  a  single  value:  mean  HS  measured  by  a  wireless  sensor 
network. 

Marks  et  al.  (1999)  also  developed  spatially  distributed  models. 
They  used  topographic  data  to  determine  estimates  of  radiation, 
temperature,  humidity,  wind,  and  precipitation  for  use  in  a  coupled 
energy  and  mass-balance  model  called  ISNOBAL. 

Recent  research  has  made  significant  advances  in  simulating 
the  effects  of  wind  on  snow  distribution.  Winstral  et  al.  (2009) 
developed  a  simplified  wind  model  that  uses  upwind  topography 
to  accurately  predict  wind  speeds.  Winstral  et  al.  (2013)  developed 
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Table  1 

CDEC  snow  course  site  descriptions. 


ID 

EL  (m) 

Name 

Asp. 

Exposure 

CAV 

2438 

Caples  Lake 

SW 

open  meadow,  low  brush 

gnz 

2103 

Grizzly  Ridge 

N 

meadow  in  scattered  timber 

K.TC 

2225 

Kettle  Rock 

S 

sloping,  open  meadow 

MSH 

2408 

Mount  Shasta 

SE 

grassy  and  rocky  meadow 

nth 

2835 

North  Lake 

SE 

grassy  meadow 

SPD 

1585 

Lake  Spaulding 

level 

grassy  meadow 

mg 

1838 

Highland  Lakes 

NW 

medium  sized  meadow  in  dense  timber 

nys 

2012 

Huysink 

W 

open  meadow  on  one  leg,  opening  in  timber  on  second  leg 

a  snow  distribution  algorithm  that  uses  terrain  structure,  vegeta¬ 
tion,  wind,  and  precipitation  data  to  simulate  wind-affected  snow 
accumulation.  It  accurately  predicted  disparate  snow  distribution 
caused  by  inhomogeneous  precipitation  and  redistribution  by 
wind.  Winstral  and  Marks  (2014)  analyzed  the  effects  of  wind  on 
snow  distribution.  They  found  that  high  wind  speeds  increased 
snow  depth  variability,  that  forested  sites  decreased  variability 
by  moderating  wind  effects,  and  that  consistent  wind  directions 
produced  accumulation  patterns  that  were  stable  between  years. 

Sturm  et  al.  (2010)  used  HS,  day  of  the  year,  and  climate  classes, 
such  as  Alpine,  Maritime,  and  Tundra,  to  estimate  snowpack  densi¬ 
ty.  Estimated  snowpack  density  was  used  to  convert  HS  measure¬ 
ments  into  SWE  estimates. 

Guan  et  al,  (2010)  found  that  atmospheric  rivers  (ARs),  are  asso¬ 
ciated  with  intense  storms  that  contribute  a  large  percentage  of 
snow  during  most  years.  Because  AR  storms  are  relatively  warm, 
the  participation  of  AR  participation  into  snowfall  versus  rainfall 
is  sensitive  to  minor  variation  in  surface  air  temperature. 

Rittger  et  al.  (2011)  combined  satellite-based  measurements  of 
snow-covered  area  with  energy  balance  calculations  to  retroactive¬ 
ly  calculate  distributed  SWE  at  the  date  of  maximum  accumula¬ 
tion,  using  the  “reconstruction”  technique  originally  developed 
by  Martinec  and  Rango  (1981).  This  calculation  was  then  used  to 
evaluate  the  accuracy  of  two  real-time  models.  They  found  that 
at  elevations  below  1500  m,  the  real-time  models  overestimated 
SWE  because  of  early  season  melt,  and  at  elevations  above 
3000  m,  the  real-time  models  underestimated  SWE  because  they 
do  not  sample  these  higher  elevations.  It  is  possible  that  this  tech¬ 
nique  could  be  used  to  evaluate  the  effectiveness  of  the  inductive 
learning  methods  that  we  describe  in  this  work. 

3.  Training  data  and  model  inputs 

Inductive  machine  learning  requires  substantial  datasets  for 
developing  and  evaluating  models,  and  we  acquired  extensive 
hydrological  and  meteorological  data  for  use  in  our  experiments. 
We  focused  on  two  types  of  available  datasets  that  are  approxima¬ 
tions  of  mean  catchment  SWE.  First,  we  consider  a  record  of  CDEC 
snow  courses  from  the  Sierra  Nevada.  We  observe  that  CDEC  snow 
courses  are  intended  to  provide  an  estimation  of  SWE  at  a  par¬ 
ticular  elevation  (USDA,  2014),  though  in  fact  they  are  linear  tran¬ 
sects  of  SWE  samples.  Second,  we  consider  a  record  of  Snowcloud 
sensor  network  readings  from  Norway  and  California.  Snowcloud 
provides  distributed  coverage  of  snow  depth  readings  for  the 
deployment  area,  as  well  as  fine  time  granularity,  and  can  support 
better  estimations  of  mean  catchment  SWE  than  periodic  snow 
courses. 

3.1.  Experiment  Set  I  data 

Experiment  Set  I  used  data  collected  from  eight  sites  across 
California.  There  were  three  main  types  of  data:  SWE  from  manual 
snow  courses,  SWE  measurements  from  snow  pillows,  and  air  tem¬ 
perature  data. 


The  California  Data  Exchange  Center  (CDEC)  provided  an  exten¬ 
sive  database  of  snow  data.  The  snow  courses  that  we  used,  which 
are  described  in  Table  1,  were  performed  monthly,  were  about  200 
meters  long,  and  consisted  of  10  measurements,  the  mean  of  which 
was  recorded.  CDEC  also  maintains  single-point  SWE  measurement 
data  from  snow  pillows  at  sites  throughout  California.  Of  the  404 
snow  course  sites,  59  are  co-located  with  snow  pillows. 

The  National  Climate  Data  Center  (NCDC)  maintains  meteoro¬ 
logical  data,  such  as  air  temperature,  wind  speed,  and  solar  radia¬ 
tion  measurements,  collected  at  weather  stations  across  the  United 
States.  We  used  data  from  the  four  NCDC  stations  which  are  locat¬ 
ed  within  30  km  of  CDEC  snow  courses.  We  arbitrarily  chose  a 
30  km  cutoff  because  we  suspected  that  meteorological  activity 
within  that  distance  might  be  predictive  of  measurements  at  the 
snow  course.  The  models  generated  by  machine  learning  will  not 
make  significant  use  of  input  data  that  is  not  predictive. 

Significant  gaps  exist  in  the  NCDC  database,  and  of  the  various 
sensor  modalities,  air  temperature  data  is  the  most  complete. 
Using  more  meteorological  inputs  and  necessarily  fewer  data  sam¬ 
ples,  we  had  previously  been  unable  to  generate  effective  models 
of  SWE.  For  Experiment  Set  I,  therefore,  air  temperature  was  the 
only  meteorological  input.  Air  temperature  is  known  to  be  a  highly 
effective  predictor  of  melt  rate  because  it  is  correlated  with  long¬ 
wave  atmospheric  radiation,  the  most  important  energy  source 
for  snowmelt  (Ohmura,  2001).  Air  temperature  is  made  accessible 
to  the  models  by  three  variables:  minTemp7,  maxTemp7,  and 
meanTemp7,  which  aggregate  daily  values  over  the  seven  days 
inclusively  preceding  the  day  for  which  SWE  is  estimated. 

We  used  the  temporal  and  spatial  intersection  of  available  data 
from  these  three  sources  (CDEC  snow  courses,  CDEC  snow  pillows, 
NCDC  air  temperature  data)  to  construct  eight  datasets,  based  on 
eight  snow  course  sites.  These  snow  courses  were  selected  because 
they  are  coincident  with  either  snow  pillow  data,  NCDC  air  tem¬ 
perature  data,  or  both,  over  a  range  of  time  that  includes  a  large 
number  of  samples  points  (greater  than  100  except  for  one  site). 
The  constructed  datasets  are  summarized  in  Table  2. 

3.2.  Experiment  Set  II  data 

Experiment  Set  II  used  HS  data  collected  by  four  Snowcloud 
sensor  nodes  in  Sulitjelma,  Norway  between  January  and  April, 
2013.  Each  node  sampled  HS  every  six  hours.  We  averaged  HS 


Table  2 

Experiment  Set  I  data  summary  by  CDEC  site. 


ID 

Pillow 

NCDC  base 

Dist  (Mi) 

Samples 

Years 

CAV 

YES 

N/A 

N/A 

177 

1970-2011 

gi zz 

YES 

N/A 

N/A 

207 

1970-2011 

K.TC 

YES 

N/A 

N/A 

159 

1979-2011 

MSH 

NO 

Mount  Shasta 

5.98 

137 

1973-2011 

NTH 

NO 

Bishop  Airport 

18.27 

147 

1973-2011 

SPD 

NO 

Blue  Canyon  Nyack 

4.56 

174 

1977-2011 

H1G 

YES 

Mount  Shasta 

18.31 

75 

1980-2012 

nys 

YES 

Blue  Canyon  Nyack 

9.79 

111 

1984-2011 
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Table  3 

Snowcloud  deployment  coordinates. 


Sulitjelma,  Norway 

Sagehen,  CA 

Tower 

Lat. 

Long. 

Tower 

Lat. 

Long. 

1 

67.0981 

16.0488 

1 

39.43161 

-120.23975 

2 

67.0983 

16.0497 

2 

39.43155 

-120.23936 

3 

67.0983 

16.0482 

3 

39.43140 

-120.23976 

4 

67.0987 

16.0487 

4 

39.43173 

-120.23882 

5 

39.43173 

-120.23864 

6 

39.43204 

-120.23872 

measurements  from  the  four  nodes  (Table  3)  and  then  over  each 
day  to  produce  93  estimates  of  mean  catchment  HS.  These  values 
served  as  ground-truth  HS  for  experiments  at  Sulitjelma. 

Approximately  16  km  away  from  the  Sulitjelma  Snowcloud 
deployment  site  is  Storstilla  nedanfor  Balvatn  in  Nordland 
County,  station  number  164.12.0  (Balvatn).  The  Balvatn  station 
records  both  HS  and  SWE.  Daily  HS  measurements  collected  at 
Balvatn  compose  the  HS  input  variable  to  models  developed  for 
Sulitjelma  in  Experiment  Set  11. 

Six  Snowcloud  wireless  sensor  network  sensor  nodes  were 
deployed  within  the  Sagehen  Creek  Field  Station,  near  Truckee, 
California,  from  January  to  May,  2010.  Each  node  reported  daily 
HS  measurements,  which  we  averaged  to  generated  99  estimates 
of  mean  catchment  SWE.  These  values  served  as  ground-truth  HS 
for  experiments  at  Sagehen.  Note  that  the  same  WSN  data  were 
used  by  Moeser  (2010). 

In  order  to  assess  the  significance  of  the  source  of  single-point  HS 
input  variables,  we  developed  models  for  estimating  mean  HS  at  the 
Sagehen  Snowcloud  deployment  using  inputs  from  two  different 
CDEC  sites,  Independence  Camp  ( TDC )  and  Huysink  ( HyS ).  1VC  is 
approximately  5.5  km  away  from  the  Snowcloud  deployment  and, 
like  Sagehen,  is  on  the  Eastern  side  of  the  Sierra  crest.  HyS  is 
approximately  30  km  away,  on  the  Western  side  of  the  crest. 

3.3.  Time  of  year 

Because  the  dynamics  underlying  snowpack  distribution  vary 
over  the  course  of  a  snow  season,  for  example  between  periods 
dominated  by  deposition  and  periods  dominated  by  ablation,  we 
introduce  time  of  year  (TOY)  as  an  independent  variable  for  both 
experiment  sets.  This  allows  models  to  distinguish  parts  of  the 
snow  season.  Time  of  year  is  an  integer  value  expressing  the  num¬ 
ber  of  days  since  January  1. 

3.4.  Preparation  of  datasets 

We  define  a  dataset,  D,  for  each  experiment  (each  row  of  Table  6 
and  each  location  in  each  row  of  Table  5).  Elements  of  a  dataset  D 

(a)  Random  division:  dataset  is  randomly 
divided  into  three  subsets  of  equal  size. 


(c)  Three  bins:  dataset  is  divided  into  three 
temporally  contiguous  bins,  which  are  each 
divided  into  three  subsets. 


take  the  form  of  a  3-tuple,  (T,  0,p),  where  T,  time,  specifies  a  calen¬ 
dar  date,  0  is  an  estimate  of  the  true  value  of  the  independent  vari¬ 
able,  and  p  is  a  vector  of  predictor  variables.  Although  T  is  used  to 
generate  predictor  variables  such  as  TOY  and  air  temperature 
statistics,  it  is  not  itself  a  predictor  variable  and  is  therefore  not 
included  in  p.  T  is  unique  in  D  so  that  no  two  data  samples  in  D 
have  the  same  T : 

V(Tu01,p1),(Tu02,p2)  <eD  0i  =  02  and  Pi=p2  (1) 

In  Experiment  Set  I,  0  is  an  approximation  of  mean  catchment  SWE 
derived  by  manual  snow  course.  In  Experiment  Set  II,  0  is  an 
approximation  of  mean  catchment  HS  derived  from  Snowcloud 
WSN  measurements. 

Depending  on  the  experiment,  p  includes  some  combination  of 
HS  measured  at  a  snow  pillow,  SWE  measured  at  a  snow  pillow, 
TOY  (an  integer  value  derived  from  T),  and  air  temperature,  (which 
is  composed  of  three  variables:  minTemp7,  maxTemp7,  and 
meanTemp7).  The  Model  inputs  columns  of  Table  5  and  Table  6  spe¬ 
cify  the  contents  of  p  for  each  experiment. 

In  order  that  a  model  developed  from  D  may  be  evaluated  on 
new,  unseen  data,  D  is  divided  into  training,  g,  and  testing,  t,  sub¬ 
sets.  The  training  set  is  twice  as  large  as  the  testing  set.  However, 
GP  and  BT  require  that  g  be  further  divided  into  grow,  g,  and  selec¬ 
tion,  s,  subsets: 

g=gus  and  gns  =  0  and  |g|  =  |s|  (2) 

In  all  experiments,  D  is  first  divided  into  g.s,  and  t: 

D  — gusuT  and  gnsnr  =  0  and  |g]=|s|  =  |r|  (3) 

For  BM  and  LR,  g  and  s  are  simply  combined  into  g  and  used  as 
training  data.  As  discussed  in  more  detail  in  Section  4,  in  the  case  of 
GP  and  BT  g  is  used  to  generate  a  set  of  models  and  s  is  used  to 
determine  which  one  should  be  kept  and  evaluated  on  t.  In  any 
case,  q  is  used  to  obtain  a  single  model,  which  is  then  exposed  to 
t  to  evaluate  its  ability  to  predict  unseen  data. 

We  explored  several  methods  for  dividing  D  into  g,s,  and  r. 
In  Experiment  Set  I  and  in  the  first  part  of  Experiment  Set  II 
(Experiment  Set  II:  Random  Division ),  the  chronologically 
ordered  D  is  randomly  shuffled  and  then  divided  into  thirds, 
as  illustrated  by  Fig.  5a.  This  method  has  the  effect  that  a  large 
portion  of  the  training  data  is  likely  to  be  temporally  proximal 
to  testing  data. 

As  discussed  further  in  Section  5,  we  found  in  Experiment  Set  II 
that  the  temporal  proximity  between  g  and  t  caused  machine 
learning  to  map  TOY  values  to  estimates  of  HS.  The  models  memor¬ 
ized  the  data  rather  than  capturing  the  relationships  among  the 
data.  We  therefore  conducted  Experiment  Set  II:  4  Bins.  Instead 
of  shuffling  D,  we  maintained  its  ordering  and  divided  it  into  four 


(b)  Four  bins:  dataset  is  divided  into  four 
temporally  contiguous  bins,  which  are  each 
divided  into  three  subsets. 


(d)  Two  bins:  dataset  is  divided  into  two 
temporally  contiguous  bins,  which  are  each 
divided  into  three  subsets. 


(e)  Three  bin  case  illustrating  random  off¬ 
set. 

Fig.  5.  Techniques  for  dividing  a  chronologically  ordered  dataset  into  g,s,  and  t  (white,  light  gray,  and  dark  gray  respectively). 
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chronologically  contiguous  bins.  Each  bin  is  then  subdivided  into 
three  chronologically  contiguous  subsets  which  are  assigned  to 
g.s,  and  t.  This  method  is  illustrated  by  Fig.  5b.  We  also  conducted 
Experiment  Set  II:  3  Bins  and  Experiment  Set  II:  2  Bins,  as  illustrated 
in  Fig.  5c  and  d.  As  we  move  from  Experiment  Set  II:  Random 
Division  to  Experiment  Set  II:  2  Bins,  the  division  of  D  transitions 
from  finer  to  coarser  temporal  granularity.  As  this  granularity 
becomes  coarser,  it  becomes  more  difficult  for  machine  learning 
to  use  TOY  to  simply  memorize  data.  However,  it  also  becomes 
more  difficult  for  models  to  capture  the  variation  of  the  dynamics 
of  snowpack  distribution  over  the  course  of  a  snow  season. 

In  order  to  introduce  stochasticity  into  the  division  D  and  thus 
allow  the  repetition  of  experiments  to  produce  a  distributed  sam¬ 
ple  of  results,  a  randomly  generated  offset  shifts  the  starting  point 
of  the  division.  Fig.  5e  illustrates  the  effect  of  this  offset  in  the  case 
of  three  bins. 

4.  Calculation 

In  this  section  we  first  describe  how  we  compared  the  perfor¬ 
mance  of  different  snowpack  modeling  techniques.  We  then 
describe  the  various  modeling  techniques  that  we  used,  with  spe¬ 
cial  emphasis  on  GP. 

4.1.  Comparing  estimation  methods 

In  order  to  compare  the  performance  of  two  machine  learning 
techniques,  M  and  M' ,  on  a  dataset  D,  D  is  divided  into  complemen¬ 
tary  subsets  q  and  t.  Methods  M  and  M'  are  applied  to  q  to  produce 
estimators  0  and  O'.  This  process  may  be  deterministic  or  nondeter- 
ministic.  In  Experiment  Set  1  and  Experiment  Set  II:  Random 
Division,  nondeterminism  is  introduced  by  the  random  division  of 
D.  GP  introduces  further  nondeterminism  by  the  stochasticity  of 
the  GP  algorithm.  The  BT  algorithm  is  deterministic  when  a  single 
input  variable  is  used,  but  nondeterministic  when  applied  to  mul¬ 
tiple  input  variables.  Estimators  0  and  O'  are  applied  to  r  to  deter¬ 
mine  the  mean  absolute  errors  of  the  estimators  MAE(O)  and 
MAE(d'),  as  we  will  discuss  in  Section  4.2. 

This  process  of  randomly  dividing  D  and  applying  M  and  Mr  to 
obtain  MAE(0)  and  MAE(O')  is  repeated  30  times,  resulting  in  vec¬ 
tors  of  estimator  errors  eM  and  eM.  each  with  cardinality  30.  We 
consider  eM  and  eM-  to  be  statistical  samples  of  errors  drawn  from 
the  population  of  errors  that  method  M  and  M'  could  produce  given 
D.  We  chose  to  collect  30  samples  because  a  sample  size  of  at  least 
30  allows  the  Central  Limit  Theorem  to  be  safely  applied  without 
assuming  a  normal  population  distribution,  permitting  the  applica¬ 
tion  of  the  one-sample  t-test  to  calculate  confidence  intervals  and 
the  paired  two-sample  t  test  to  test  hypotheses. 

The  means  of  eM  and  eM  are  unbiased  estimates  of  the  true 
population  means  and  fl'M.  To  find  out  if  M'  outperforms  M  on 
dataset  D  we  pose  the  hypotheses: 

H0  ■.  (i'M  =  fiM  ( Null  hypothesis ) 

Ha :  n'M  <  )iM  ( alternative  hypothesis) 

and  apply  the  Student’s  t-test  for  paired  samples  to  eM  and  eM.  If  the 
Null  hypothesis  is  rejected,  we  say  that  method  M'  produces  lower 
error  (performs  better)  on  dataset  D  than  does  M.  We  report  the  p- 
value,  the  probability  that  we  have  performed  a  Type  I  error  by 
rejecting  a  true  Null  hypothesis. 

4.2.  Evaluating  estimator  error 

Recall  that  an  element  d  of  dataset  D  takes  the  form  (T,  0,p )  and 
that  D  has  been  divided  into  q  and  t.  An  estimation  method  M  is 


applied  to  g  c  D  to  generate  an  estimator  0,  which  is  a  function 
from  predictor  variables  p  to  dependent  variable  y,  an  estimate 
of  0. 

d-.p^y  y&O 

The  error  of  0  on  an  input  vector  is  the  difference  between  the 
estimate  it  produces  and  ground  truth. 

E eip)  =  0(p)  -  6  (4) 

The  error  is  calculated  on  each  sample  in  r  to  determine  the  mean 
absolute  error  of  the  estimator: 

MAE(d)  (5) 

where 

t  =  (di , . . . ,  dk)  and  p,  e  d,  e  z  c  D 

4.3.  Basic  method 

The  basic  method  (BM)  assumes  that  SWE  as  measured  at  a  snow 
pillow  is  representative  of  catchment-wide  SWE.  It  naively  esti¬ 
mates  ground  truth  (snow  course-derived)  SWE  to  be  the  same 
as  the  independent  variable  (snow  pillow-derived)  SWE  measure¬ 
ment.  Error  in  the  predictive  power  of  BM  expresses  the  difference 
between  snow  pillow  measurements  and  snow  course  SWE  mea¬ 
surements.  If  x  represent  SWE  measured  at  the  snow  pillow,  then 

x  e  p  and  0(p)  =  x  (6) 

Unlike  the  more  sophisticated  machine  learning  techniques,  BM 
does  not  make  use  of  training  data  to  generate  a  model. 

4.4.  Linear  regression 

Linear  regression  (LR)  fits  a  least-squares  linear  model  to  train¬ 
ing  data  which  is  then  evaluated  on  test  data  (Hastie  et  al„ 
2009).  LR  expresses  the  linear  relationships  between  independent 
and  dependent  variables.  We  used  the  gsLmultifitJinear  function 
from  the  GNU  Scientific  Library  (GSL,  2014)  to  perform  LR.  We 
include  LR  in  order  to  gain  insight  into  the  data  we  are  using.  LR 
will  perform  less  well  than  nonlinear  techniques  only  if  the  mod¬ 
eled  data  contain  nonlinear  relationships. 

4.5.  Genetic  programming 

GP  is  an  evolutionary  algorithm,  inspired  by  biological  evolu¬ 
tion,  that  iteratively  evolves  populations  of  parse  trees  to  perform 
symbolic  regression  (Koza,  1992)  (see  Fig.  4).  In  this  work,  the  trees 
are  snowpack  models,  estimator  functions,  that  use  available  inde¬ 
pendent  variables  to  estimate  mean  SWE  (Experiment  Set  I)  or  HS 
(Experiment  Set  II)  at  the  catchment  scale.  Tree  terminals  are  input 
variables  and  constants,  while  internal  nodes  are  arithmetic  opera¬ 
tors.  The  operators  we  used  are  listed  in  Table  4. 


Table  4 

GP  parameters. 


Parameter 

Value 

population  size 

1000  (Experiment  Set  I),  2000  (Set  II) 

number  of  generations 

3000  (Experiment  Set  I),  10,000  (Set  II) 

max  tree  size 

30 

mutation  operators 

crossover  (60%),  mutation  (40%) 

binary  operators 

addition,  subtraction,  mult.,  division,  power 

unary  operators 

log,  exponential,  sine,  cosine, 

terminals 

independent  variables,  constants  values 
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We  used  the  lil-gp  Genetic  Programming  System  (System, 
2013),  an  open  source  implementation  of  GP,  in  order  that  we 
might  make  any  needed  modifications.  We  modified  lil-gp  to 
implement  multi-objective  Pareto  optimization. 

GP  begins  by  generating  a  starting  population  of  randomly  con¬ 
structed  trees.  Each  tree  in  the  population  is  evaluated  on  training 
data  to  determine  its  fitness,  defined  as  the  inverse  of  mean  error. 
Trees  are  selected  according  to  their  size  and  fitness  to  produce  the 
population  for  the  next  generation.  Genetic  operators  make 
stochastic  modifications  to  the  new  trees,  randomly  perturbing 
their  fitness  values.  The  genetic  operators  we  used  were  mutation 
and  crossover.  Mutation,  which  is  applied  to  40%  of  new  trees, 
selects  a  subtree  at  random  and  replaces  it  with  new,  randomly 
generated  subtree.  In  crossover,  which  is  applied  instead  of  muta¬ 
tion  60%  of  the  time,  two  parent  trees  exchange  subtrees,  resulting 
in  two  novel  offspring.  Crossover  allows  recombination  of  subtrees 
from  existing  models  while  mutation  introduces  new  subtrees  to 
the  population,  maintaining  genetic  diversity.  Because  it  is  likely 
that  subtrees  taken  from  existing,  partially  evolved  models  will 
be  more  useful  than  new,  randomly  generated  subtrees,  crossover 
is  applied  more  frequently  than  mutation.  This  process  is  iterated 
over  many  evolutionary  generations,  each  time  replacing  the 
population  with  a  new  population  of  altered  trees.  Over  time,  this 
produces  populations  of  increasing  fitness. 

The  average  wall-clock  time  for  one  experiment  using  the 
Vermont  Advanced  Computing  Core  (VACC)  supercomputer  was 
333  s  for  Experiment  Set  1  (3000  generations)  and  1207  s  for 
Experiment  Set  II  (10,000  generations).  The  total  wall-clock  time 
for  all  of  Experiment  Set  I  was  approximately  89  h.  The  total 
wall-clock  time  for  all  of  Experiment  Set  II  was  approximately 
321  h.  Because  GP  is  a  stochastic  optimization  method,  its  compu¬ 
tation  complexity  is  unclear.  However,  recent  work  has  begun  to 
address  this  problem  (Neumann  et  al.,  2011;  Durrett  et  al.,  2010). 

One  challenge  facing  GP,  like  all  techniques  for  deriving  a  model 
from  training  data,  is  over-fitting.  An  over-fit  model  performs  well 
on  training  data  but  does  not  generalize  well  and  fails  on  unseen 
data.  It  memorizes  values  instead  of  capturing  the  mathematical 
relationships  among  the  data. 

The  size  of  a  GP  model  (number  of  nodes  in  a  tree)  constrains  its 
complexity  and  fitness.  Trees  that  are  too  small  are  too  simple  to 
accurately  model  the  data  and  are  under-fit.  They  perform  poorly 
on  both  training  and  testing  data.  Trees  that  become  too  large  per¬ 
form  extremely  well  on  training  data  but,  due  to  over-fitting,  per¬ 
form  poorly  on  unseen  data.  Somewhere  between  these  extremes 
lies  the  best,  non-over-fit  model. 

In  order  to  explore  the  gradient  from  small,  under-fit  models  to 
large,  over-fit  models,  we  added  multi-objective  Pareto  optimiza¬ 
tion  to  lil-gp.  Pareto  optimization  applies  evolutionary  pressure 
toward  multiple  simultaneous  goals,  in  this  case  low  error  and 
small  model  size,  by  producing  a  population  (front)  of  non- 
dominated  models.  A  tree  is  dominated  by  another  tree  if  it  is  infe¬ 
rior  by  all  objectives,  i.e.  it  is  both  larger  and  has  lower  fitness.  A 
Pareto  front  (non-dominated  front)  consists  of  a  set  of  trees  such 
that  no  tree  is  dominated  by  any  other  tree  on  the  front.  The 
non-dominated  trees  are  selected  at  each  GP  generation  so  that 
each  population  is  a  non-dominated  front,  including  the  final 
population.  The  result  of  GP  is  therefore  a  set  of  trees  of  various 
sizes.  We  set  an  absolute  upper  bound  at  size  30  because  we  had 
observed  that  models  with  size  larger  than  30  were  consistently 
over-fit.  Arranged  from  smallest  to  largest,  the  error  of  these  trees 
on  the  training  data  decreases  monotonically.  Error  on  unseen 
data,  however,  will  decrease  only  to  a  point,  and  will  then  increase 
beyond  some  tree  size  as  the  models  become  over-fitted. 

At  this  point  is  the  tree  size  that  will  maximize  performance  on 
q  without  over-fitting.  Models  no  bigger  than  this  can  express 


features  common  to  both  training  and  testing  data  but  cannot 
express  features  that  are  unique  to  the  training  data.  However,  this 
size  threshold  is  not  known  while  generating  models  because  test 
data  are  not  available.  It  must  remain  unseen  for  model  testing.  We 
therefore  developed  a  novel  selection  set  method  for  selecting  a  sin¬ 
gle  model  from  the  Pareto  front.  In  the  selection  set  method,  the 
training  data  are  further  divided  into  two  subsets  of  equal  size,  a 
growth  set,  g,  and  a  selection  set,  s  (Eq.  2).  GP  is  applied  to  g  to 
obtain  a  Pareto  front.  Each  model  on  the  front  is  then  evaluated 
on  s.  GP  returns  the  model  that  performs  best  (lowest  error)  on  s. 
We  used  the  election  set  method  in  all  experiments. 

4.6.  Binaiy  regression  trees 

We  include  BT  in  Experiment  Set  II  in  order  to  compare  GP  to 
another  nonlinear,  less  computationally  demanding,  modeling 
technique.  Erxleben  et  al.  (2002)  compared  the  performances  of 
four  spatial  interpolation  methods  to  estimate  SWE  and  found  that 
a  method  combining  binary  regression  trees  with  geostatistical 
methods  was  more  accurate  than  other  methods.  We  used  the 
DecisionTreeRegressor  class  of  the  Scikit-learn  machine  learning 
module  for  Python  (Pedregosa  et  al.,  2011).  This  software  imple¬ 
ments  the  Classification  and  Regression  Trees  (CART)  algorithm, 
which  is  similar  to  C4.5  (Hastie  et  al.,  2009).  BT  is  parameterized 
by  the  maximum  tree  depth;  we  used  default  options  for  other 
parameters.  As  with  GP,  the  data  for  BT  was  divided  into  g,  s,  and 
t.  For  each  experiment,  a  set  of  trees  was  trained  on  g  such  that 
the  nth  tree  had  a  maximum  depth  of  n.  The  maximum  value  of 
n  was  determined  by  incrementing  n  until  further  increase  did 
not  result  in  larger  trees.  The  maximum  value  of  n  varied  between 
7  and  13. 

Like  the  Pareto  front  produced  by  GP  with  multi-objective  opti¬ 
mization,  this  methods  results  in  a  gradient  of  models  ranging 
from  very  small  models  with  high  error  on  g  to  very  large  models 
with  low  error  on  g.  Each  is  evaluated  on  s  and  the  one  with  the 
lowest  error  is  returned  by  BT  to  be  evaluated  on  r  in  order  to 
determine  model  error.  Thus,  we  applied  the  same  selection  set 
method  to  BT  as  to  GP  in  order  to  discourage  over-fitting  and  to 
provide  similar  exposure  to  the  data  so  that  the  performance  of 
the  techniques  may  be  compared.  Note,  however,  that  in  the  case 
of  GP,  multi-objective  optimization  applies  pressure  toward  model 
parsimony  continuously  over  the  course  of  the  evolution  of  a 
population  of  models.  In  the  case  of  BT,  the  selection  set  method 
was  applied  once  to  a  set  of  models  after  they  have  been  generated. 

5.  Experiments:  descriptions  and  results 

In  this  section  we  describe  the  experiments  we  conducted  and 
report  the  results. 

5.3.  Experiment  Set  I 

In  Experiment  Set  1  measurements  from  snow  courses  provided 
ground-truth  SWE  data.  We  developed  models  to  predict  snow 
course  SWE  at  eight  different  sites  in  California  where  snow  cours¬ 
es  had  been  conducted  (Table  1 ).  Three  sites  {CAP,  GHZ,  K.TC)  are 
located  at  snow  pillows  but  are  not  near  any  NCDC  weather  sta¬ 
tions.  Three  sites  (J\fTH,SPD,MSH)  are  near  NCDC  stations  but 
are  not  at  snow  pillows.  Two  of  the  snow  course  sites 
{Hys  and  WIG)  are  located  at  snow  pillows  and  are  also  near 
NCDC  stations. 

First,  we  conducted  experiments  at  sites  with  snow  pillows  but 
without  weather  stations  {CAP,GHZ,K.TC).  These  experiments 
explored  how  well  linear  and  nonlinear  models  predict  snow 
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Table  5 

Experiment  Set  I  summary. 


Experiment 

Model  inputs 

Locations 

a 

air  temp. 

msh,  nth,  spd,  me,  nys 

b 

TOY 

all 

c 

pillow 

cav,  gnz,  k,tc ,  mg,  nys 

d 

air  temp.,  TOY 

msh,  nth,  spd,  me,  nys 

e 

air  temp.,  pillow 

mg, nys 

f 

TOY,  pillow 

cav,  gnz,  ktc,  mg,  nys 

g 

air  temp.,  TOY,  pillow 

mg, nys 

course-derived  ground  truth  SWE  using  only  snow  pillow  measure¬ 
ments.  Inputs  to  the  models  were  pillow  SWE  and  TOY.  At  each  site 
we  developed  models  with  three  combinations  of  input  variables: 
TOY  alone,  pillow  SWE  alone,  and  TOY  combined  with  pillow  SWE. 
In  each  case,  we  compared  the  performance  of  GP,  LR,  and  BM. 

Second,  we  conducted  experiments  at  sites  near  weather  sta¬ 
tions  but  without  snow  pillows  (ICTC.MSH.ATTH).  These  experi¬ 
ments  explored  how  well  linear  and  nonlinear  models  predict 
snow  course-derived  ground  truth  SWE  using  air  temperature  data 
without  access  to  snow  pillow  SWE  measurements.  Inputs  to  the 
models  were  air  temperature  and  TOY.  At  each  site  we  develop 
models  with  three  combinations  of  input  variables:  temperature 
alone,  TOY  alone,  and  temperature  combined  with  TOY.  In  each 
case,  we  compare  the  performance  of  GP  to  LR. 


Third,  we  conducted  experiments  at  sites  that  are  near  weather 
stations  and  have  snow  pillows  [WIG,  HYS).  These  experiments 
explored  how  well  linear  and  nonlinear  models  predict  snow 
course-derived  ground  truth  SWE  using  both  pillow  SWE  measure¬ 
ments  and  air  temperature  data.  Inputs  to  the  models  were  SWE, 
air  temperature,  and  TOY.  At  each  site  we  develop  models  with  sev¬ 
en  unique  combinations  of  input  variables:  temperature  alone,  TOY 
alone,  pillow  SWE  alone,  temperature  and  TOY  together,  tem¬ 
perature  and  pillow  SWE  together,  TOY  and  pillow  SWE  together, 
and,  finally,  temperature,  TOY,  and  pillow  SWE  together. 

Table  5  summarizes  Experiment  Set  I.  Each  experiment  was 
repeated  30  times  to  generate  error  samples  for  each  method. 
Figs.  6-9  plot  the  mean  values  of  the  samples.  Error  bars  indicate 
95%  confidence  intervals,  i.e.  sample  mean  ±(SEM  x  1.96).  GP 
and  LR  had  similar  error,  but  both  had  lower  error  than  BM  with 
p-value  less  than  0.001  in  all  cases. 

The  mean  ground  truth  SWE  value  in  mm  at  each  site  was: 
CAP  :  1145 ,gnz  :  1256,02  :  687, MSH  :  1747, A/TW  :  337 ,SPV  : 
697,  nig :  594. nys  ■.  i065.. 

5.2.  Experiment  Set  II 

In  Experiment  Set  II  models  predicted  HS  instead  of  SWE.  While 
research  on  the  influence  of  meteorological  factors  on  snowpack 
distribution  is  extensive  (Logan,  1973;  Elder  et  al„  1991; 
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Fig.  6.  Experiment  Set  I  results:  CAP,  QKZ,  andJCTC. 


(a)  MSH  (b)  MTU  (c)  SVV 


Fig.  7.  Experiment  Set  I  results:  MSH,NTH,andSPD. 
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Fig.  8.  Experiment  Set  I  results:  HIS. 


Fig.  9.  Experiment  Set  1  results:  HyS. 


Schmucki  et  al.,  2014;  Hock  and  Noetzli,  1997),  the  inclusion  of 
meteorological  inputs  does  not  always  improve  snowpack  model 
performance  (Moeser,  2010),  and  the  inclusion  of  air  temperature 
data  did  not  improve  model  performance  in  Experiment  Set  I. 
Therefore,  in  Experiment  Set  II  we  focus  on  TOY  and  single-point 
HS  measurements  as  predictors  of  mean  catchment  HS.  Instead  of 
manual  snow  course  data  as  in  Experiment  Set  I,  ground-truth  data 
are  derived  from  HS  measurements  collected  by  the  Snowcloud 
WSN.  We  compared  the  performance  of  three  machine  learning 
techniques;  LR,  BT,  and  GP. 

We  developed  estimators  to  predict  HS  at  two  sites:  Sulitjelma, 
Norway  and  the  Sagehen  Experimental  Forest,  California.  At 
Sulitjelma,  model  inputs  were  combinations  of  HS  at  Balvatn  and 
TOY.  At  Sagehen,  model  inputs  were  combinations  of  HS  at  HyS, 
HS  at  TDC,  and  TOY.  Table  6  summarizes  Experiment  Set  II.  We 
repeated  each  experiment  four  times  ( Random  Division,  4  Bins,  3 
Bins,  2  Bins )  and  each  of  these  30  times  to  generate  error  samples. 

Figs.  10-13  plot  the  mean  values  of  the  samples,  i.e.  the  error  of 
the  modeling  techniques  on  testing  data.  Error  bars  indicate  95% 
confidence  intervals,  i.e.  sample  mean  ±(SEM  x  1.96).  Stars  indi¬ 
cate  p-values  for  the  Student’s  paired  t-test  with  the  hypothesis 


Table  6 

Experiment  Set  II  summary. 


Experiment 

Location 

Model  inputs 

a 

Sulitjelma,  Norway 

TOY 

b 

Sulitjelma,  Norway 

HS  at  Balvatn 

c 

Sulitjelma,  Norway 

HS  at  Balvatn,  TOY 

d 

Sagehen,  California 

TOY 

e 

Sagehen,  California 

HS  at  HyS 

f 

Sagehen,  California 

HS  at  TDC 

g 

Sagehen,  California 

hs  at  nys,  TOY 

h 

Sagehen,  California 

HS  at  JVC,  TOY 

+++  ++ 


Fig.  10.  Experiment  Set  II  (random  division)  model  error. 


ttt 


Fig.  11.  Experiment  Set  II  (four  bins)  model  error. 


the  GP  does  not  have  lower  error  than  BT,  i.e.  the  probability  that 
GP  does  not  outperform  BT.  One  star,  *,  indicates  that  p  is  less  than 
0.05,  **  indicates  thatp  is  less  than  0.01,  and  ***  indicates  that  p  is 
less  than  0.001.  Similarly,  plus  signs  indicate  p-values  for  the 
hypothesis  that  GP  does  not  have  lower  error  than  LR,  i.e.  the  prob¬ 
ability  that  GP  does  not  outperform  LR.  One  plus  sign,  +,  indicates 
that p  is  less  than  0.05,  and  ++  indicates  thatp  is  less  than  0.01.  The 
mean  ground  truth  HS  value  at  Sulitjelma  was  1.1900  m.  The  mean 
ground  truth  HS  value  at  Sagehen  was  0.728  m. 

Figs.  14-17  plot  the  mean  sizes  of  the  models  whose  perfor¬ 
mance  is  reported  in  Figs.  10-13.  In  the  case  of  GP  and  BT,  these 


Fig.  12.  Experiment  Set  II  (three  bins)  model  error. 
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Fig.  13.  Experiment  Set  II  (two  bins)  model  error. 


Fig.  14.  Experiment  Set  II  (random  division)  model  size. 


Fig.  15.  Experiment  Set  II  (four  bins)  model  size. 


Fig.  16.  Experiment  Set  II  (three  bins)  model  size. 


are  the  models  selected  using  the  selection  set  method.  For  GP, 
model  size  is  the  number  of  nodes  in  the  GP  tree.  For  BT,  model  size 
is  the  number  of  nodes  in  the  binary  tree.  For  LR,  model  size  is  the 
number  of  operators  and  values,  specifically  5  in  the  case  of  a  sin¬ 
gle  independent  variable  and  9  in  the  case  of  two  independent 
variables.  Stars  indicate  p-values  for  the  Student’s  paired  t-test 
with  the  hypothesis  the  GP  models  are  not  smaller  than  BT  models. 
One  star,  *,  indicates  that  p  is  less  than  0.05,  **  indicates  that  p  is 
less  than  0.01,  and  ***  indicates  that  p  is  less  than  0.001. 

6.  Discussion 

In  this  section  we  discuss  the  results  of  our  experiments,  offer 
some  hypotheses  to  explain  our  findings,  and  suggest  possible  next 
steps  for  continued  research. 


Fig.  17.  Experiment  Set  II  (two  bins)  model  size. 


6.1.  Experiment  Set  I 

In  Experiment  Set  I  GP  performed  at  least  as  well  as  other  meth¬ 
ods  in  all  experiments.  This  result  was  expected  because  GP  is  cap¬ 
able  of  generating  the  same  models  as  LR  and  BM.  We  did  not 
perform  hypothesis  tests  comparing  GP  with  LR  because  visual 
inspection  of  error  means  and  95%  confidence  intervals  (Figs.  6- 
9)  suggests  that  the  methods  performed  similarly.  At  the  sites 
where  a  snow  pillow  was  present,  the  performance  of  BM  was 
evaluated.  At  all  of  these  sites,  in  all  of  the  experiments  where  pil¬ 
low  SWE  was  an  input  variable  (b,  c,  f),  both  LR  and  GP  performed 
better  (p-value  less  than  0.001)  than  BM. 

These  results  suggest  that  machine  learning  techniques  can  be 
used  to  develop  models  that  predict  mean  catchment  SWE  more 
accurately  than  BM.  In  general,  models  performed  better  when 


snow  pillow  data  were  included.  However,  GP  did  not  outperform 
LR. 

Because  LR  performed  as  well  as  GP  in  Experiment  Set  1,  we  sus¬ 
pected  strict  linearity  among  the  explanatory  relationships  in  the 
data.  We  hypothesize  that  because  snow  courses  measure  SWE 
only  at  a  single  location,  they  failed  to  capture  existing  nonlin¬ 
earities,  and  that  even  though  the  relationships  underlying  snow- 
pack  distribution  are  nonlinear,  our  Experiment  Set  1  data  is 
linear.  We  therefore  did  not  further  pursue  nonlinear  modeling, 
such  as  BT,  in  Experiment  Set  1. 

6.2.  Experiment  Set  II 

First  we  conducted  Experiment  Set  II:  Random  Division.  GP  out¬ 
performed  LR  in  every  experiment  except  in  Norway  when  the  only 
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model  input  was  HS  at  Balvatn.  In  every  experiment  in  California 
where  TOY  was  an  input,  BT  has  much  lower  error  than  either  GP 
or  LR.  In  all  experiments  where  TOY  was  an  input,  the  resulting 
BT  models  were  very  large.  GP  also  had  lower  error  and  larger 
model  sizes  when  TOY  was  used  then  when  TOY  was  not  used. 
We  had  originally  introduced  the  TOY  variable  to  allow  models  to 
distinguish  different  parts  of  the  season.  However,  we  hypothe¬ 
sized  the  BT,  and  to  a  lesser  extent  GP,  were  abusing  the  TOY  vari¬ 
able  to  memorize  snow  data  by  mapping  TOY  data  to  ground  truth 
HS.  Even  though  training  and  testing  data  were  technically  distinct, 
many  of  the  samples  in  the  testing  data  were  temporally  or  spatial¬ 
ly  proximal  to  samples  in  the  training  data.  The  testing  data  were 
not  truly  unseen  with  respect  to  the  TOY  variable.  Even  though 
models  generalized  well  to  the  testing  data,  they  were  over-fitting 
to  the  TOY  variable  and  would  likely  not  generalize  to  truly  unseen 
data,  e.g.  from  another  snow  season. 

To  test  this  hypothesis  and  address  the  possible  problem  of 
over-fitting  to  the  TOY  variable,  we  repeated  Experiment  Set  II 
three  more  times.  In  Experiment  Set  II:  4  Bins,  3  Bins,  and  2  Bins, 
we  successively  decreased  the  temporal  overlap  between  training 
and  testing  data  and  increase  the  coarseness  of  the  temporal  gran¬ 
ularity  of  the  division  into  training  and  testing  data.  Proceeding 
through  this  sequence,  it  became  more  difficult  for  machine  learn¬ 
ing  to  memorize  HS  data  by  over-fitting  to  the  TOY  variable.  At  the 
same  time,  BT  error  increased  and  the  performance  of  GP  with 
respect  to  BT  improved.  These  results  suggest  that  GP  is  more  resi¬ 
lient  against  over-fitting  than  BT,  possible  as  a  result  of  multi-ob¬ 
jective  optimization.  Furthermore,  when  the  ability  of  machine 
learning  to  exploit  the  TOY  variable  by  memorizing  HS  the  data 
were  minimized,  GP  significantly  outperformed  both  LR  and  BT. 

6.3.  Future  work 

We  believe  that  the  preliminary  results  discussed  in  this  work 
are  promising  and  warrant  further  research  into  of  the  applicability 
of  GP  to  snowpack  modeling. 

This  work  should  be  expanded  into  a  multi-year  study. 
Although  Experiment  1  used  snow  course  data  collected  over  sever¬ 
al  years,  Snowcloud  data  used  in  Experiment  II  was  limited  to  sin¬ 
gle  snow  season.  A  multi-year  study  would  allow  models  trained 
on  Snowcloud  data  during  one  or  several  years  to  be  evaluated 
on  unseen  data  from  another  year.  Models  trained  on  multi-year 
data  may  be  more  robust  to  application  in  future  years  than  are 
models  trained  on  single-year  data.  Even  without  collecting  more 
data,  Experiment  Set  I  could  be  modified  so  that  models  are  trained 
on  data  from  earlier  years  and  tested  on  data  from  later  years. 

Beyond  those  discussed  here,  there  are  many  machine  learning 
techniques  that  should  be  applied  to  the  problem  of  catchment-s¬ 
cale  SWE  estimation.  GP  possesses  a  unique  combination  of  desir¬ 
able  qualities,  but  its  performance  should  be  compared  against 
other  methods  such  as  ANNs,  nonlinear  multiple  regression,  and 
FFX  (McConaghy,  2011),  a  non-evolutionary  symbolic  regression 
technology. 

The  only  meteorological  input  to  our  models  was  air  tem¬ 
perature.  However,  meteorological  data  involving  wind,  solar 
radiation,  humidity,  etc.  are  available  for  many  locations  and  have 
been  shown  to  influence  snow  distribution  (Logan,  1973;  Elder 
et  al.,  1991;  Schmucki  et  al.,  2014;  Hock  and  Noetzli,  1997). 
Future  work  should  incorporate  more  potential  meteorological 
predictors  of  SWE  and  HS. 

Topographic  features  significantly  shape  snow  distribution,  and 
models  of  this  relationship  have  been  developed  and  used  exten¬ 
sively  (Winstral  et  al.,  2013;  Marofi  et  al.,  2011;  Chang  and  Li, 
2000;  Tabari  et  al„  2010;  Anderton  et  al.,  2004;  Grtinewald  et  al., 
2013;  Molotch  et  al.,  2005;  Elder  et  al.,  1998).  Although  topograph¬ 
ic  data  was  not  an  explicit  input  in  our  experiments,  models 


developed  with  our  techniques  that  use  input  variables  to  predict 
distributed  snow  measurements  likely  express  some  of  the  rela¬ 
tionship  between  topography  and  snowpack  distribution. 
Previous  efforts  to  model  snowpack  using  topographic  data  have 
derived  explicit  model  inputs  from  DEMs.  The  possibility  that  GP 
could  play  an  active  role  in  determining  which  topographical  fea¬ 
tures  to  use  should  be  explored.  GP  might  discover  new  methods 
for  extracting  information  from  DEMs  that  is  predictive  of  snow¬ 
pack  distribution.  It  is  possible  that  machine  learning  could  use 
topographic  and  other  data  to  produce  non-cite-specific  models, 
which  are  trained  on  data  from  one  or  more  site  and  then  applied 
to  other  sites. 

Schwaerzel  and  Bylander  (2006)  developed  high-order  statisti¬ 
cal  functions  for  GP  to  model  financial  data.  These  allowed  GP 
models  to  dynamically  select  and  aggregate  a  slice  of  time  series 
data.  Future  work  should  apply  these  techniques  to  allow  GP  to 
determine  how  to  select  and  aggregate  meteorological  and  topo¬ 
graphic  data.  We  made  air  temperature  available  to  GP  by  means 
of  functions  that  aggregate  daily  measurements  over  an  arbitrary 
seven  day  window.  Instead,  GP  could  inductively  discover  how 
models  should  dynamically  select  and  aggregate  a  section  of  time 
series  data  according  to  changing  circumstances. 


7.  Conclusion 

In  this  paper  we  have  described  novel,  low-cost  methods  for 
catchment-scale  SWE  estimation  using  machine  learning  algo¬ 
rithms.  The  commonly  used  method  of  estimating  catchment-scale 
SWE  from  a  single  point  measurement  is  error-prone  because  of 
the  spatial  heterogeneity  of  snowpack  distribution.  We  envision 
an  approach  wherein  short-term  field  campaigns  collect  ground- 
truth  data  for  generating  snowpack  models  which  can  subsequent¬ 
ly  augment  existing  NRT  snow  telemetry.  Toward  this  end,  we 
explored  a  suite  of  machine  learning  techniques  to  extrapolate 
estimates  of  mean  catchment  SWE  from  single  point  SWE  measure¬ 
ments  and  other  available  data  and  pursued  three  key  research 
directions.  First,  we  addressed  the  question  of  which  machine 
learning  approaches  are  best  for  this  problem.  Second,  we  dis¬ 
cussed  and  pursued  the  use  of  a  range  of  possible  input  para¬ 
meters.  Finally,  we  grappled  with  the  issue  of  ground-truthing 
given  limited  datasets. 

We  compared  the  performance  of  a  basic  method  (BM)  which 
assumes  no  spatial  variability  of  SWE,  linear  regression  (LR), 
Genetic  Programming  (GP),  and  binary  regression  trees  (BT).  We 
emphasize  GP  because  it  produces  nonlinear,  white-box  models 
without  requiring  assumptions  about  model  form.  GP  can  be  aug¬ 
mented  with  multi-objective  optimization  to  constrain  model 
complexity  and  mitigate  over-fitting.  We  found  that  machine 
learning  techniques  generally  outperformed  BM,  demonstrating 
the  spatial  variability  of  SWE.  Nonlinear  techniques  outperformed 
linear  models  in  Experiment  Set  II,  but  not  in  Experiment  Set  I,  sug¬ 
gesting  that  there  are  nonlinear  relationships  among  the  modeled 
data  used  in  Experiment  Set  II.  Snowpack  distribution  at  the  catch¬ 
ment  scale  has  been  shown  to  be  highly  nonlinear.  It  is  possible 
that  the  spatially  distributed  sampling  technique  (Snowcloud 
wireless  sensor  network)  used  for  ground-truthing  in  Experiment 
Set  II  captured  some  of  the  nonlinearity  of  snowpack  distribution, 
while  the  single-location  sampling  (manual  snow  courses)  used  for 
Experiment  Set  I  did  not. 

When  we  naively  divided  our  data  at  random  to  generate  train¬ 
ing  and  testing  data,  BT  had  much  lower  error  than  GP  in  experi¬ 
ments  where  time  of  year  (TOY)  was  an  input  variable.  In  these 
cases,  BT  models  were  much  larger  than  PG  models  and  we  sus¬ 
pected  that  they  were  memorizing  data  by  mapping  TOY  to  snow 
depth.  When  we  instead  divided  the  data  into  more  temporally 
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contiguous  training  and  testing  data  in  order  to  prevent  this  behav¬ 
ior,  BT  model  size  decreased  and  GP  outperformed  BT. 

We  emphasize  that  GP  can  flexibly  incorporate  new  predictors 
of  catchment-scale  SWE  into  the  models  generated,  augmenting 
its  capacity  to  extrapolate  estimates  of  mean  catchment-wide 
SWE  from  a  single  point  measurement.  Genetic  programming  will 
make  use  of  input  data  that  helps  explain  the  dependent  variable 
while  ignoring  data  that  does  not.  Our  choice  of  independent  vari¬ 
ables  was  a  result  of  intuitive  guesses  combined  with  constraints 
on  available  data.  Topographic  information  was  ruled  out  because 
we  were  unable  to  determine  the  precise  locations  of  snow  pillows. 
Multiple  forms  of  meteorological  data  were  available,  but  air  tem¬ 
perature  was  the  most  complete,  allowing  us  to  compose  datasets 
large  enough  for  effective  machine  learning.  However,  the  inclu¬ 
sion  of  air  temperature  did  not  have  a  significant  impact  on  model 
performance  in  our  first  experiment  set,  and  so  we  did  not  use  any 
meteorological  data  in  our  second  experiment  set. 

Because  it  has  been  shown  that  the  forcing  effects  underlying 
snowpack  distribution  change  over  the  course  of  a  snow  season, 
we  introduced  time  of  year  (TOY)  as  an  independent  variable  so 
that  models  can  distinguish  seasonal  differences.  However,  we 
found  that  nonlinear  models  used  TOY  to  memorize  the  data  by 
mapping  TOY  to  ground  truth  measurements  instead  of  expressing 
the  underlying  relationships  of  snowpack  distribution.  The  ideal 
solution  to  this  problem  would  be  a  multi-year  study  using  spatial¬ 
ly  distributed  data  collected  by  Snowcloud.  However,  given  the 
limitation  of  a  one  year  dataset,  we  modified  how  data  was  divided 
to  constrain  the  temporal  proximity  of  training  and  testing  data. 

We  conducted  two  sets  of  experiments,  using  available  data,  as 
successive  approximations  of  our  goal  of  near-real-time  catch¬ 
ment-scale  SWE  estimation.  When  ground  truth  was  obtained  from 
distributed  sampling  techniques  and  when  we  were  careful  to 
mitigate  overfitting  to  the  TOY  variable,  GP  outperformed  other 
techniques. 
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4.4  Kriegman  et  al.  “Evolving  spatially  aggregated...”  (2016). 

A  technical  manuscript  describing  how  symbolic  regression  of  environmental  data  can  produce 
models  interpretable  by  laypersons  follows. 
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Abstract.  Satellite  imagery  and  remote  sensing  provide  explanatory 
variables  at  relatively  high  resolutions  for  modeling  geospatial  phenom¬ 
ena,  yet  regional  summaries  are  often  desirable  for  analysis  and  action¬ 
able  insight.  In  this  paper,  we  propose  a  novel  method  of  inducing  spatial 
aggregations  as  a  component  of  the  machine  learning  process,  yielding 
regional  model  features  whose  construction  is  driven  by  model  predic¬ 
tion  performance  rather  than  prior  assumptions.  Our  results  demonstrate 
that  Genetic  Programming  is  particularly  well  suited  to  this  type  of  fea¬ 
ture  construction  because  it  can  automatically  synthesize  appropriate 
aggregations,  as  well  as  better  incorporate  them  into  predictive  models 
compared  to  other  regression  methods  we  tested.  In  our  experiments  we 
consider  a  specific  problem  instance  and  real-world  dataset  relevant  to 
predicting  snow  properties  in  high-mountain  Asia. 

Keywords:  spatial  aggregation,  feature  construction,  genetic  program¬ 
ming,  symbolic  regression 


1  Introduction 

Regional  modeling  focuses  on  explaining  phenomena  occurring  at  a  regional,  as 
opposed  to  site-specific  or  global  scales  [11].  Regional  models  are  of  interest  in 
many  remote  sensing  applications,  as  they  provide  meaningful  units  for  analysis 
and  actionable  insight  to  policymakers.  Yet  satellite  imagery  and  remote  sens¬ 
ing  provide  variables  at  relatively  high  resolutions.  Consequently,  studies  often 
involve  decisions  concerning  how  to  integrate  this  information  in  order  to  model 
regional  processes.  Considering  measurements  at  each  individual  spatial  unit  as 
a  separate  model  feature  can  result  in  a  high  dimensional  problem  in  which  high 
variance  and  overfitting  are  major  concerns.  For  this  reason,  spatial  aggregation 
is  often  applied  in  this  setting  to  uniformly  up-sample  variables  to  be  consistent 
with  the  response.  Although  in  averaging  variables  across  all  spatial  units  in  the 
region,  we  discard  information  which  could  in  turn  diminish  prediction  accuracy 
and  our  understanding  of  underlying  phenomena. 

Rather  than  strictly  incorporating  individual  spatial  units  or  uniformly  up- 
sampling,  it  might  instead  be  beneficial  to  construct  features  of  a  regional  model 
using  particularly  important  subsets  of  geographical  space.  In  this  paper,  we 
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move  away  from  uniform  up-sampling  aggregations  towards  more  flexible  and  in¬ 
teresting  aggregation  operations  predicated  on  their  subsequent  use  as  features 
of  a  regional  model.  We  propose  a  novel  method  of  inducing  spatial  aggrega¬ 
tions  as  a  component  of  the  machine  learning  process,  yielding  features  whose 
construction  is  driven  by  model  performance  rather  than  prior  assumptions. 

In  experiments  designed  to  explore  these  techniques,  we  consider  a  specific 
problem  and  real  dataset:  estimating  regional  Snow  Water  Equivalent  (SWE) 
in  high-mountain  Asia  with  satellite  imagery.  Improved  estimation  of  SWE  in 
mountainous  regions  is  critical  [3]  but  is  difficult  due  in  part  to  complex  charac¬ 
teristics  of  snow  distribution  [2]. 


2  Methods 

We  take  a  comparative  approach  to  the  SWE  problem,  considering  ridge  regres¬ 
sion,  lasso,  and  GP-based  symbolic  regression1.  For  each  regression  model,  we 
consider  a  filter-based  method  of  feature  construction  in  addition  to  a  second, 
more  dynamic  method.  For  linear  regression,  we  incorporate  a  wrapper  approach 
in  which  constructed  features  and  the  regression  model  are  induced  in  separate 
learning  processes,  with  feedback  between  the  two.  For  symbolic  regression,  we 
use  an  embedded  approach  where  constructed  features  and  the  regression  model 
are  induced  simultaneously  over  the  course  of  an  evolutionary  run. 


The  Dataset.  The  SWE  dataset2  is  derived  from  data  collected  by  NASA’s  Ad¬ 
vanced  Microwave  Scanning  Radiometer  (AMSR2/E)  and  Moderate  Resolution 
Imaging  Spectroradiometer  (MODIS)  for  March  1  -  September  30,  in  2003  -  2011, 
over  an  area  that  spans  most  of  the  high  mountain  Asia.  We  have  three  explana¬ 
tory  variables  measured  daily  across  a  113  x  113  regular  grid  for  1935  days:  (1) 
mean  and  (2)  standard  deviation  of  sub-pixel  Snow  Covered  Area  [4,10],  as  well 
as  (3)  an  estimate  of  SWE  derived  from  passive  microwaves  [15].  Our  response 
variable  is  regional  SWE,  an  attribute  of  the  entire  study  region,  represented 
as  a  single  value  for  each  of  the  1935  days.  The  response  was  “reconstructed” 
by  combining  snow  cover  depletion  record  with  a  calculation  of  the  melt  rate  to 
retroactively  estimate  how  much  snow  had  existed  in  the  region  [9[. 


2.1  Regression  Models 

Ridge  regression  [5]  is  similar  to  ordinary  least  squares  (OLS)  but  subject  to  a 
bound  on  the  L2-norm  of  the  coefficients.  Because  of  the  nature  of  its  quadratic 
constraint,  ridge  regression  cannot  produce  coefficients  exactly  equal  to  zero 
and  keeps  all  of  the  features  in  its  model.  Lasso  (Least  Absolute  Shrinkage  and 

1  The  source  code  necessary  for  reproducing  our  results  is  available  at 
https : //github . com/ skriegman/ppsn_2016. 

2  Raw  satellite  data  was  pre-processed  by  Dr.  Jeff  Dozier  (UCSB)  using  previously 
reported  techniques  and  is  available  upon  request. 
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Selection  Operator.  [16])  modifies  the  ridge  penalty  and  is  subject  to  a  bound 
on  the  Li-norm  of  the  coefficients.  The  geometry  of  this  Li-penalty  has  a  strong 
tendency  to  produce  sparse  solutions  with  coefficients  exactly  equal  to  zero.  In 
many  high  dimensional  settings,  lasso  is  the  state-of-the-art  regression  method 
given  its  ability  to  produce  parsimonious  models  with  excellent  generalization 
performance.  For  both  lasso  and  ridge  regression,  the  parameter  constraining 
the  coefficients  is  set  through  cross-validation. 

Genetic  Programming  (GP,  [7])  is  a  very  flexible  heuristic  technique  which 
can  conveniently  represent  free-form  mathematical  equations  (candidate  regres¬ 
sion  models)  as  parse  trees.  GP’s  inherent  flexibility  is  well-suited  for  our  particu¬ 
lar  problem  because  it  can  efficiently  express  spatial  aggregations  and  seamlessly 
combine  them  into  the  learning  process  with  minimal  assumptions.  Furthermore, 
the  “white  box”  nature  of  GP  may  provide  physical  insights  about  this  complex 
problem  that  is  currently  lacking,  as  in  other  domains  [1, 13]. 

To  search  the  space  of  possible  GP  trees  we  use  a  variant  of  Age-Fitness 
Pareto  Optimization  (AFPO,  [12]).  AFPO  is  a  multiobjective  method  that  re¬ 
lies  on  the  concept  of  genotypic  age,  an  attribute  intended  to  preserve  diversity. 
We  extend  AFPO  to  include  an  additional  objective  of  model  size,  defined  as  the 
syntactic  length  of  an  individual  tree.  The  size  attribute  protects  parsimonious 
models  which  are  less  prone  to  overfitting  the  training  data.  The  GP  algorithm 
therefore  identifies  the  Pareto  front  using  three  objectives  (all  minimized):  age, 
error  (fitness),  and  size.  For  the  fitness  objective,  we  use  a  correlation-based 
function  rather  than  pure  error,  and  define  fcOR  —  1  —  |<^(s,  s)|,  where  </>(s  —  s) 
denotes  Pearson  correlation  between  model  predictions  (s)  and  actual  values  of 
our  response  (s),  regional  SWE.  Correlation  has  recently  been  shown  to  outper¬ 
form  error-based  search  drivers  given  that  if  a  model  makes  a  systematic  error  it 
could  be  easily  eliminated  by  linearly  scaling  the  output  and  therefore  should  be 
protected  [14].  Accordingly,  for  all  GP  implementations,  we  apply  a  linear  trans¬ 
formation  after  fcOR  -driven  evolution  has  concluded,  by  using  an  individual 
program  (model)  output  as  the  single  input  of  OLS  on  the  training  data. 

Our  implemented  GP  experiments  used  ramped  half-and-half  initialization 
with  a  height  range  of  2  —  6  and  an  instruction  set  including  unary  ({sin,  cos,  log, 
exp})  and  binary  functions  ({x,  +,  — ,  /}).  One  thousand  individuals  in  the  pop¬ 
ulation  are  subject  to  crossover  (with  probability  0.75)  and  mutation  (with  prob¬ 
ability  0.01)  over  the  course  of  1000  generations.  There  is  a  static  limit  on  the 
tree  height  (17)  as  well  as  the  tree  size  (300  nodes).  Each  experiment  consists 
of  30  evolutionary  runs,  from  which  the  best  model  (lowest  training  / cor )  is 
selected.  The  selected  model  is  then  transformed  using  OLS,  and  subsequently 
validated  using  unseen  test  data. 

Standard  Methods.  Ridge  regression,  lasso,  and  GP  may  be  performed  on 
the  raw  data  using  each  variable  at  each  individual  spatial  unit  as  a  separate 
feature.  We  denote  these  methods  as  Standard  Ridge  (SR),  Standard  Lasso  (SL) 
and  Standard  GP  (SGP).  SR,  SL  and  SGP  each  have  access  to  113  x  113  x  3  = 
38307  features,  but  only  1720  observations  in  each  fold  of  data. 
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2.2  Feature  Construction  Methods 

Feature  construction  is  a  well  studied  problem  and  the  utility  of  genetic  program¬ 
ming  for  feature  construction  has  been  recognized  in  many  previous  studies  [8]. 
The  key  difference  in  our  work  from  this  past  work  is  the  nature  of  the  data 
being  modeled.  We  presume  that  there  exist  spatial  autocorrelations  of  varying 
size  and  shape  that,  if  aggregated  to  improve  the  signal  to  noise  ratio,  yield 
features  supporting  more  accurate  predictions. 

In  a  regional  model,  we  can  construct  features  by  aggregating  higher  di¬ 
mensional  variables  across  space.  However,  it  is  not  entirely  clear  what  kind  of 
aggregations  are  useful  as  features  of  a  predictive  model.  Grouping  variables 
based  on  similarity  or  dissimilarity  does  not  necessarily  produce  useful  regional 
features.  In  this  paper,  we  make  an  assumption  about  the  importance  of  distance 
and  continuity  in  effective  spatial  aggregations,  based  on  Tobler’s  first  law  of  ge¬ 
ography  [17]  which  states  that  “everything  is  related  to  everything  else,  but  near 
things  are  more  related  than  distant  things.”  Accordingly,  we  limit  the  space  of 
possible  spatial  aggregations  to  be  an  average  of  values  within  a  circular  spatial 
area  defined  by  its  centerpoint  and  radius.  However,  where  to  aggregate,  how 
many  aggregations  to  perform,  and  how  to  combine  the  aggregates  must  still  be 
determined  manually  or  decided  during  model  optimization.  We  view  filters  and 
wrappers  as  intermediary  steps  in  relaxing  assumptions  towards  our  embedded 
approach,  which  automates  all  three  of  these  aspects. 


The  Filter  Method.  Filter-based  feature  construction  methods  transform  or 
“filter”  the  original  variables  as  a  preprocessing  step,  prior  to  modeling.  Our  fil¬ 
ter  for  the  SWE  problem  represents  a  static  up-sampling  transformation  of  the 
original  variables.  Each  variable  is  decomposed  in  space  by  a  grid  of  overlapping 
circles3  of  equal  radii  centered  on  a  square  lattice  pattern  of  points  (see  Figure 
la,c,e  for  example).  Each  constructed  feature  corresponds  to  the  average  (arith¬ 
metic  mean)  of  a  particular  variable  sampled  within  a  particular  circle  of  space. 
Units  that  reside  in  an  overlapping  region  of  two  separate  circles  are  included  in 
the  calculation  of  both  features.  Since  there  are  three  explanatory  variables  in 
the  SWE  dataset,  an  R  x  R  grid  corresponds  to  p  =  3 R2  constructed  features. 
The  constructed  features  are  then  used  as  inputs  for  ridge  regression,  lasso,  and 
GP,  which  we  will  refer  to  as  Filtered  Ridge  (FR),  Filtered  Lasso  (FL),  and  Fil¬ 
tered  GP  (FGP).  We  will  also  specify  the  value  of  R  used  in  a  particular  model 
instance  as  a  subscript,  e.g.  FR15  denotes  Filtered  Ridge  with  R= 15.  We  con¬ 
sider  filters  with  R  £  {1,2,...,  20},  however  note  that  the  standard  methods  are 
essentially  filters  with  R  =  113,  albeit  with  the  non-overlapping  square  pixels. 


The  Wrapper  Method.  Wrapper-based  feature  construction  methods  incor¬ 
porate  feedback  from  the  fit  of  the  model.  We  implement  wrappers  around  both 

3  The  shape  of  circles  are  in  reality  so-called  “small  circles,”  as  they  lie  on  the  surface 
of  earth. 
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ridge  regression  and  lasso  in  order  to  enable  the  circular  sampling  regions  to  de¬ 
fine  their  own  center  and  radius.  The  circles  are  no  longer  fixed  on  a  grid  with  a 
predetermined  size.  Instead,  each  constructed  feature  is  uniquely  parameterized 
by  the  coordinates  of  a  center  unit  (x,y),  as  a  latitude  and  longitude  tuple,  and 
a  radius  r,  as  a  single  value  floating  point  number  in  km.  The  center  can  be  any 
spatial  unit  in  the  region,  including  one  at  the  edge  of  the  raster.  The  radius  is 
restricted  to  be  within  0  and  1000  km,  which  is  flexible  enough  to  contain  only 
a  single  unit  or  span  the  entire  region  (see  Figure  lb,d  for  example). 

Wrapped  Ridge  (WR)  and  Wrapped  Lasso  (WL)  separately  use  a  ridge/lasso- 
driven  hill  climbing  algorithm  to  construct  features  that  minimize  Mean  Absolute 
Error  (MAE),  i.e.  ^  Xu=i  |s*  —  s*|,  where  s,;  is  the  actual  value  of  our  response 
(regional  SWE)  and  s)  is  output  predicted  by  the  model  over  n  observations. 
The  algorithm  uses  the  same  number  of  circles  for  each  of  the  three  variables, 
initializing  their  parameters  ( x ,  y ,  r)  randomly.  For  1000  iterations,  a  single  con¬ 
structed  feature  (circle)  is  randomly  selected  and  subject  to  a  Gaussian  mutation 
on  one  of  its  parameters  with  standard  deviation  equal  to  25%  of  the  radius  and 
centered  at  zero.  A  new  ridge/lasso  model  is  then  refit  on  the  mutated  set  of 
features  using  a  random  subset  of  data  sampled  without  replacement.  If  the  mu¬ 
tation  lowered  model  error  on  the  complementing  set  of  training  data  left  out, 
then  the  change  is  accepted.  Otherwise,  the  mutation  is  undone.  If  a  proposed 
mutation  to  the  radius  would  take  it  outside  the  restricted  range  of  0  —  1000  km, 
then  it  is  “bounced-back”  the  distance  it  would  have  exceeded  the  boundary.  For 
example,  a  random  mutation  that  would  result  in  a  radius  of  1200  km,  becomes 
1000  —  (1200  —  1000)  =  800  km.  Thirty  restarts  are  used  from  which  the  best 
model  based  on  training  data  is  selected.  We  consider  R  £  {1,2, 3, 4}  for  wrap¬ 
pers  corresponding  to  3  x  R2  features  which  really  means  3x3  x  R2  modifiable 
parameters. 


The  Embedded  Method.  By  using  GP,  we  can  allow  for  flexibility  with 
respect  to  the  placement  and  number  of  aggregations  as  well  as  the  way  in 
which  they  are  combined  to  form  a  model.  However,  stochastic  optimization 
methods  like  GP  cannot  be  easily  “refit”  in  the  same  manner  as  deterministic 
algorithms  like  ridge  regression  or  lasso.  Therefore  using  wrapper  approach  for 
GP  is  computationally  infeasible.  Instead,  modifications  to  aggregated  features 
are  implemented  through  mutation-based  operators. 

In  Genetic  Programming  with  Embedded  Spatial  Aggregation  (GPESA)  in¬ 
troduced  here,  our  constructed  features  are  represented  as  parameterized  tree 
terminals,  with  parameters  ( x,y,r ).  Constructed  features  are  randomly  initial¬ 
ized  in  the  same  manner  as  the  wrapper  method,  but  separately  for  each  terminal 
of  each  individual  in  the  population.  Greedy  Gaussian  mutations  to  the  param¬ 
eters  (x,  y,  r)  of  a  randomly  selected  constructed  feature  occur  in  the  population 
with  20%  probability,  each  generation.  Mutations  to  r  have  mean  zero  and  a 
standard  deviation  of  25%,  subject  to  the  bounce-back  rule.  Similarly,  muta¬ 
tions  to  (x,  y)  have  mean  distance  zero  and  a  standard  deviation  of  0.25r.  For 
25  iterations,  greedy  mutations  modify  the  parameterized  terminals  within  a 
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particular  GP  tree.  A  modification  is  accepted  if  it  successfully  reduces  aver¬ 
age  error  ( fcOR )  on  random  subsets  of  training  data  sampled  with  replacement. 
Aside  from  the  stochastic  application,  another  key  difference  between  the  wrap¬ 
per  method’s  hill  climbing  algorithm  and  the  GPESA’s  greedy  mutations  is  that 
the  overall  regression  model  stays  the  same  between  mutations  rather  than  being 
“refit”  after  each  mutation. 


Validation.  In  order  to  validate  the  generalization  of  models  we  partition  the 
dataset  into  nine  overlapping  folds.  Each  fold  corresponds  to  leaving  out  one 
year  for  testing  and  training  on  the  remaining  eight  (using  years  2003  -  2011). 
We  use  MAE  on  the  unseen  test  data  as  a  metric  to  assess  model  performance. 
To  account  for  a  difference  in  scale  across  any  set  of  features,  all  input  model 
features  are  standardized  over  time  by  removing  the  mean  and  scaling  to  unit 
variance.  This  means  that  as  wrapper  and  embedded  methods  construct  new 
aggregations,  the  sampled  data  is  scaled  over  time  prior  to  being  averaged  over 
space.  Since  our  goal  is  near-real-time  estimation  for  a  future  day,  the  training 
values  of  a  feature’s  mean  and  variance  are  reapplied  when  scaling  the  same 
feature  in  validation. 


3  Results 


Table  1  displays  the  test  error  of  each  valid  regression  and  feature  construction 
method  combination.  For  filters  and  wrappers,  only  the  best  performing  model 
is  displayed  and  we  indicate  the  particular  value  of  parameter  R  as  a  subscript. 
Since  the  ultimate  goal  of  our  paper  is  to  synthesize  a  method  better  than 
existing  approaches,  we  must  statistically  compare  GPESA  to  SL,  the  state-of- 
the-art  linear  regression  /  variable  selection  algorithm.  The  null  hypothesis  of 
interest  here  is  that  of  no  difference  between  GPESA  and  a  SL.  Therefore  we 
perform  yearly  Wilcoxon  signed  rank  tests  [6]  comparing  GPESA  to  SL  with 
Bonferroni  correction  across  the  nine  years.  For  five  out  of  the  nine  test  years, 
GPESA  is  significantly  better  than  SL,  while  for  the  other  four  years  there  is  no 
significant  difference  with  SL. 

Through  displaying  only  the  best  testing  filters  and  wrappers,  we  aim  to 
focus  speculation  about  GPESA  performance  through  a  conservative  lens.  Yet 
we  ultimately  view  filters  and  wrappers  as  intermediary  steps  “working  up”  to 
GPESA.  Accordingly,  the  best  test  error  better  represents  a  bound  on  the  po¬ 
tential  performance  of  a  particular  intermediary  method  even  though  it  may  not 
be  possible  to  achieve  such  performance  through  a  parameter  sweep  based  on 
the  training  data.  And  indeed,  across  all  methods  tested,  GPESA  reported  the 
lowest  recorded  median  mean-absolute  error  within  all  but  two  years  (7  of  9) 
where  it  has  the  second  lowest. 
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Year 

SR 

SL 

SGP 

fr4 

FLi9 

FGP19 

wr2 

wl3 

GPESA 

2003 

0.86 

0.51 

0.35  (0.14) 

0.50 

0.46 

0.44  (0.08) 

0.43  (0.10) 

0.49  (0.09) 

0.29  (0.09) 

2004 

0.47 

0.30 

0.32  (0.10) 

0.34 

0.29 

0.26  (0.05) 

0.37  (0.16) 

0.35  (0.16) 

0.17  (0.05) 

2005 

0.95 

0.44 

0.50  (0.13) 

0.61 

0.40 

0.52  (0.06) 

0.58  (0.11) 

0.63  (0.09) 

0.32  (0.07) 

2006 

0.66 

0.27 

0.41  (0.29) 

0.57 

0.52 

0.36  (0.06) 

0.53  (0.11) 

0.54  (0.11) 

0.27  (0.05) 

2007 

0.72 

0.33 

0.44  (0.10) 

0.42 

0.38 

0.34  (0.05) 

0.52  (0.13) 

0.50  (0.11) 

0.24  (0.06) 

2008 

1.46 

0.46 

0.60  (0.13) 

0.71 

0.64 

0.58  (0.11) 

0.70  (0.31) 

0.54  (0.26) 

0.52  (0.18) 

2009 

0.81 

0.41 

0.65  (0.08) 

0.90 

0.61 

0.56  (0.08) 

0.98  (0.10) 

1.03  (0.09) 

0.41  (0.10) 

2010 

0.62 

0.48 

0.44  (0.12) 

0.43 

0.47 

0.41  (0.06) 

0.43  (0.11) 

0.52  (0.11) 

0.32  (0.07) 

2011 

0.87 

0.48 

0.61  (0.17) 

0.77 

0.60 

0.53  (0.10) 

0.82  (0.20) 

0.93  (0.16) 

0.45  (0.12) 

Mean 

0.82 

0.41 

0.48 

0.58 

0.49 

0.44 

0.58 

0.61 

0.33 

Table  1.  Median  mean-absolute  error  with  corresponding  standard  errors  in  parenthe¬ 
ses.  Only  the  best  testing  filter-  and  wrapper-based  results  (choice  of  R)  are  displayed. 
We  explicitly  compare  GPESA  with  the  state-of-art,  SL.  Bold  values  indicate  signifi¬ 
cance  (at  0.05  level  with  Bonferroni  correction)  under  a  Wilcoxon  singed  rank  test  in 
which  the  null  hypothesis  asserts  that  distribution  of  the  differences  between  GPESA 
and  SL  is  symmetrically  distributed  about  0. 


4  Discussion 

Our  results  show  that  incorporating  dynamic  aggregations  of  higher  resolution 
variables  into  a  regional  model  is  beneficial  in  our  particular  problem  setting,  as 
compared  to  both  uniform  up-sampling  of  variables  and  a  state-of-the-art  linear 
regression  technique  (SL)  that  incorporates  individual  spatial  units.  SL  achieves 
competitive  prediction  performance  through  a  sparse  linear  combination  of  the 
individual  spatial  units,  on  par  with  SGP  which  is  not  linearly  constrained. 
Ultimately,  GPESA  performed  significantly  better  (lower  median  test  error)  than 
SL  on  a  majority  (5  of  9)  of  cross  validation  folds.  Moreover,  whenever  GPESA 
was  not  significantly  better  than  SL  it  was  not  significantly  worse. 

A  main  reason  why  GPESA  has  an  advantage  in  this  application  is  the  dif¬ 
ficulty  of  knowing  a  priori  what  the  most  important  spatial  datapoints  are,  and 
how  to  best  aggregate  them.  Additionally,  the  structure  of  the  model  itself  is 
unknown  and  it  depends  on  the  resulting  aggregations.  Therefore  this  is  not  a 
fixed  length  optimization  problem,  which  makes  it  well-suited  for  GPESA,  which 
can  search  over  different  numbers  and  non-linear  combinations  of  spatial  aggre¬ 
gations.  While  SL  can  theoretically  perform  the  same  aggregation  as  a  GPESA 
terminal  (mean  within  a  radius  of  a  geographical  point),  SL  is  restricted  to  a 
single  linear  solution  while  GPESA  is  not. 

However,  it’s  important  to  emphasize  that  the  computational  cost  of  GPESA 
is  higher  than  that  of  traditional  GP  and  much  higher  than  that  of  linear  regres¬ 
sion.  In  particular,  the  most  expensive  operation  is  the  “on  the  fly”  aggregation 
component  of  GPESA  which  makes  the  fitness  evaluation  require  500%  more 
time  than  in  SGP.  Part  of  the  incurred  cost  is  due  to  inefficiencies  of  our  imple- 
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mentation  that  necessitated  a  copy  with  all  spatial  aggregation  operations.  In 
future  work  we  will  look  at  reducing  this  overhead  through  more  efficient  data 
structures  (e.g.  k-d  trees). 


Importance  of  Spatial  Data.  To  better  understand  the  relevance  of  partic¬ 
ular  spatial  locations,  we  define  the  importance  of  a  spatial  unit  for  both  linear 
and  symbolic  methods,  separately.  For  ridge  regression  and  lasso,  we  can  define 
importance  by  exploiting  the  disposition  of  coefficients  to  be  larger  for  variables 
with  a  stronger  correlation  to  the  response,  relative  to  a  particular  feature  set. 
We  define  linear  regression  importance  of  a  particular  spatial  unit  as  the  aver¬ 
age  absolute  coefficient  of  features  that  incorporate  the  unit  into  a  regression 
model.  While  we  cannot  as  easily  determine  relative  importance  within  nonlin¬ 
ear  models,  we  can  instead  define  importance  by  exploiting  the  multiple  candi¬ 
date  solutions  provided  from  stochastic  multiobjective  optimization.  We  define 
GP  importance  of  a  particular  spatial  unit  as  the  average  absolute  correlation 
(1  —  f cor )  of  nondominated  solutions  that  incorporate  the  unit. 

To  visualize  the  importance  of  spatial  information,  we  generated  a  series  of 
heatmaps  (Figure  1).  In  Figures  la,  lc,  and  le  we  show  regional  importance 
values  of  filter  methods  for  each  R  £  {1, ...,  20},  with  the  relevant  value  of  R 
annotated  in  the  upper  left  corner  of  each  box.  Note  that  in  lasso-  and  GP- 
based  approaches,  some  variables  are  unused  (white),  while  ridge  cannot  perform 
variable  selection  and  uses  all.  Figures  lb  and  Id  plot  WR  and  WL  for  R  £ 
{1,  2,  3,4}.  Finally,  Figures  le  and  If  plot  the  importance  of  spatial  information 
in  the  GP  sense,  for  FGP  and  GPESA,  respectively.  Overall,  this  visualization 
indicates  an  agreement  among  all  methods  on  the  relatively  higher  importance 
of  information  in  the  lower  center/right  region  of  the  image. 


5  Conclusion 

In  this  work  we  developed  a  novel  method  to  address  the  problem  of  modeling 
a  regional  response  with  high  resolution  satellite  imagery.  We  moved  away  from 
uniform  up-sampling  aggregations  towards  more  flexible  and  interesting  aggre¬ 
gation  operations  predicated  on  their  subsequent  use  as  features  of  a  regional 
model.  Our  proposed  technique,  GPESA,  is  general  and  intended  to  apply  to  a 
variety  of  modeling  problems  on  spatially  organized  data.  But  as  an  application 
example,  and  as  a  setting  in  which  to  evaluate  our  techniques,  we  considered 
the  problem  of  estimating  snow  water  equivalent  in  high  mountain  Asia  using 
satellite  imagery.  Our  results  showed  that  using  GP  to  evolve  spatial  aggrega¬ 
tions  outperforms  lasso,  the  state-of-the-art  method  for  directly  incorporating 
individual  spatial  units  into  a  sparse  linear  model. 

In  future  work  we  plan  to  explore  more  flexible  spatial  and  temporal  aggre¬ 
gations  for  more  predictive  modeling  in  real  earth  science  applications. 

Acknowledgements:  Thanks  to  Dr.  Jeff  Dozier  (UCSB)  for  posing  the  high- 
mountain  Asia  SWE  problem  and  providing  associated  datasets. 
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Fig.  1.  Importance  (defined  in  Section  4)  of  spatial  units.  For  filters  a.)  FR,  c.)  FL, 
and  e.)  FGP,  importance  is  displayed  at  each  resolution  R  £  {1,2,  ...20}  and  each 
individual  filter  subplot  is  annotated  with  the  corresponding  R.  For  wrappers  b.)  WR 
and  d.)  WL,  R  £  {1,2,  3, 4}.  Finally,  f.)  GPESA,  which  has  no  R  parameter.  White 
areas  indicate  spatial  units  unused  in  feature  construction  across  all  three  exploratory 
variables. 


75 

Approved  for  public  release;  distribution  is  unlimited. 


References 


1.  J.  Bongard  and  H.  Lipson.  Automated  reverse  engineering  of  nonlinear  dynamical 
systems.  Proceedings  of  the  National  Academy  of  Sciences,  104(24) :9943-9948, 
2007. 

2.  D.  Buckingham,  C.  Skalka,  and  J.  Bongard.  Inductive  learning  of  snowpack  distri¬ 
bution  models  for  improved  estimation  of  areal  snow  water  equivalent.  Journal  of 
Hydrology,  524:311-325,  2015. 

3.  J.  Dong,  J.  P.  Walker,  and  P.  R.  Houser.  Factors  affecting  remotely  sensed  snow 
water  equivalent  uncertainty.  Remote  Sensing  of  Environment,  97(l):68-82,  2005. 

4.  J.  Dozier,  T.  H.  Painter,  K.  Rittger,  and  J.  Frew.  Time-space  continuity  of  daily 
maps  of  fractional  snow  cover  and  albedo  from  MODIS.  Advances  in  Water  Re¬ 
sources,  31(11):  1515—1526,  2008. 

5.  A.  E.  Hoerl  and  R.  W.  Kennard.  Ridge  regression:  biased  estimation  for  nonorthog- 
onal  problems.  Technometrics,  12(l):55-67,  1970. 

6.  M.  Hollander,  D.  A.  Wolfe,  and  E.  Chicken.  N onparametric  statistical  methods. 
John  Wiley  &  Sons,  2013. 

7.  J.  R.  Koza.  Genetic  Programming:  On  the  Programming  of  Computers  by  Means 
of  Natural  Selection.  MIT  Press,  Cambridge,  MA,  USA,  1992. 

8.  K.  Krawiec.  Genetic  programming-based  construction  of  features  for  machine 
learning  and  knowledge  discovery  tasks.  Genetic  Programming  and  Evolvable  Ma¬ 
chines,  3(4):329-343,  2002. 

9.  J.  Martinec  and  A.  Rango.  Areal  distribution  of  snow  water  equivalent  evaluated 
by  snow  cover  monitoring.  Water  Resour.  Res,  17(5):1480-1488,  1981. 

10.  T.  H.  Painter,  K.  Rittger,  C.  McKenzie,  P.  Slaughter,  R.  E.  Davis,  and  J.  Dozier. 
Retrieval  of  subpixel  snow-covered  area,  grain  size,  and  albedo  from  MODIS.  Re¬ 
mote  Sensing  of  Environment,  113:868-879,  2009. 

11.  J.  Rees,  A.  Gibson,  M.  Harrison,  A.  Hughes,  and  J.  Walsby.  Regional  modelling  of 
geohazard  change.  Geological  Society,  London,  Engineering  Geology  Special  Publi¬ 
cations,  22(l):49-63,  2009. 

12.  M.  Schmidt  and  H.  Lipson.  Age-fitness  pareto  optimization.  In  R.  R.iolo,  T.  Mc- 
Conaghy,  and  E.  Vladislavleva,  editors,  Genetic  Programming  Theory  and  Practice 
VIII,  volume  8  of  Genetic  and  Evolutionary  Computation,  pages  129-146.  Springer 
New  York,  2011. 

13.  M.  D.  Schmidt,  R.  R.  Vallabhajosyula,  J.  W.  Jenkins,  J.  E.  Hood,  A.  S.  Soni,  J.  P. 
Wikswo,  and  H.  Lipson.  Automated  refinement  and  inference  of  analytical  models 
for  metabolic  networks.  Physical  biology,  8(5):055011,  2011. 

14.  K.  Stanislawska,  K.  Krawiec,  and  T.  Vihma.  Genetic  programming  for  estimation 
of  heat  flux  between  the  atmosphere  and  sea  ice  in  polar  regions.  In  Proceedings  of 
the  2015  on  Genetic  and  Evolutionary  Computation  Conference,  pages  1279-1286. 
ACM,  2015. 

15.  M.  Tedesco  and  P.  S.  Narvekar.  Assessment  of  the  nasa  amsr-e  swe  product. 
Selected  Topics  in  Applied  Earth  Observations  and  Remote  Sensing,  IEEE  Journal 
of,  3(1):141-159,  2010. 

16.  R.  Tibshirani.  Regression  shrinkage  and  selection  via  the  lasso.  Journal  of  the 
Royal  Statistical  Society.  Series  B  (Methodological),  pages  267-288,  1996. 

17.  W.  R.  Tobler.  A  computer  movie  simulating  urban  growth  in  the  detroit  region. 
Economic  geography,  pages  234-240,  1970. 


76 

Approved  for  public  release;  distribution  is  unlimited. 


5  Improving  symbolic  regression. 

A  large  portion  of  the  work  conducted  under  this  award  involved  purely  theoretical  progress  on 
improving  symbolic  regression.  At  the  outset  of  the  award,  work  was  completed  on  hybridizing 
symbolic-  and  more  traditional  regression  methods  to  produce  one  that  is  superior  to  either  method 
working  alone  [j4]  |5) . 

Most  recently,  we  have  addressed  a  major  limitation  of  symbolic  regression,  which  is  its  scala¬ 
bility.  At  its  heart,  symbolic  regression  uses  population-based  stochastic  optimization:  models  are 
randomly  modified,  and  if  the  modification  reduces  prediction  error,  the  model  is  retained;  other¬ 
wise,  it  is  likely  to  be  discarded.  This  leads  to  vast  computational  waste,  as  most  models  produced 
by  these  random  perturbations  perform  worse  that  the  originating  model. 

In  [(8l,  we  demonstrate  that  the  semantics  of  different  parts  of  a  model  can  be  employed  to 
reduce  the  randomness  of  model  modification  and  thus  increase  the  likelihood  that  a  change  to  a 
model  is  beneficial,  even  for  non-convex  problems.  This  is  accomplished  by  removing  part  of  a 
model,  and  looking  for  a  complementary  part  from  another  model.  Connecting  these  two  model 
parts  together  keeps  the  semantics  of  the  resulting  model  close  to  the  semantics  of  the  contributing 
models,  thus  increasingly  the  likelihood  of  model  improvement. 

In  [□  we  demonstrate  how  to  overcome  a  major  challenge  in  the  field  of  symbolic  regression. 
In  order  to  stocastically  optimization  a  population  of  models,  it  is  imperative  to  maintain  diversity 
in  the  model  population.  This  is  usually  done  by  ‘pushing’  models  away  from  one  another.  This 
however  antagonizes  the  reduction  of  model  error,  which  ‘pulls’  the  models  toward  the  desired  pre¬ 
dictions  an  optimal  model  should  make.  In  this  particular  work  we  show  that  instead  of  ‘pushing’ 
models  away  from  one  another,  we  incentivize  models  to  spread  out,  as  much  as  possible,  around 
the  desired  output  of  the  optimal  models.  This  reduces  the  antogonism  between  low  error  and 
model  population  diversity,  and  leads  to  significant  improvements  in  the  accuracy  and  parsimony 
of  the  final  models. 

5.1  Relevance  for  U.S.  defense  and  security. 

While  the  theoretical  advances  achieved  in  this  part  of  the  project  cannot  be  immediately  be  ap¬ 
plied  to  defense  and  security  areas  of  interest,  it  would  be  relatively  straightforward  to  incorporate 
these  advances  into  the  three  application-specific  projects  outlined  above  or  incorporated  into  other 
applications  of  symbolic  regression  of  military  interest. 

5.2  Icke  et  al.  “Improving  genetic  programming...”  (2013). 

A  technical  manuscript  describing  how  to  hybridize  symbolic-  and  linear  regression  methods  fol¬ 
lows. 
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Abstract — Symbolic  regression  (SR)  is  a  well  studied  method  in 
genetic  programming  (GP)  for  discovering  free-form  mathemat¬ 
ical  models  from  observed  data.  However,  it  has  not  been  widely 
accepted  as  a  standard  data  science  tool.  The  reluctance  is  in  part 
due  to  the  hard  to  analyze  random  nature  of  GP  and  scalability 
issues.  On  the  other  hand,  most  popular  deterministic  regression 
algorithms  were  designed  to  generate  linear  models  and  therefore 
lack  the  flexibility  of  GP  based  SR  (GP-SR).  Our  hypothesis 
is  that  hybridizing  these  two  techniques  will  create  a  synergy 
between  the  GP-SR  and  deterministic  approaches  to  machine 
learning,  which  might  help  bring  the  GP  based  techniques  closer 
to  the  realm  of  big  learning.  In  this  paper,  we  show  that  a  hybrid 
deterministic/GP-SR  algorithm  outperforms  GP-SR  alone  and 
the  state-of-the-art  deterministic  regression  technique  alone  on  a 
set  of  multivariate  polynomial  symbolic  regression  tasks  as  the 
system  to  be  modeled  becomes  more  multivariate. 

Index  Terms — symbolic  regression,  hybrid  algorithms,  elastic 
net,  regularization 

I.  Introduction 

Symbolic  regression  is  one  the  most  popular  applications  of 
genetic  programming  and  an  attractive  alternative  to  standard 
regression  approaches  due  to  its  flexibility  in  generating  free¬ 
form  mathematical  models  from  observed  data  without  any  do¬ 
main  knowledge.  Indeed,  user-friendly  genetic  programming 
based  symbolic  regression  (GP-SR)  tools  such  as  Eureqa  [1] 
have  started  to  gain  more  attention  from  the  scientific  com¬ 
munity  over  the  last  couple  years.  Despite  various  success 
stories  (  [2],  [3],  [4])  and  claims  that  they  will  one  day 
‘replace  scientists’,  GP-SR  applications  (or  any  evolutionary 
computation  based  approach  in  general)  have  not  yet  been 
widely  accepted  as  standard  tools  for  the  data  scientists. 
Although  many  stochastic  optimization  algorithms  such  as 
stochastic  gradient  descent  (SGD)  [5]  and  metaheuristics  such 
as  simulated  annealing  [6]  are  well  established  in  the  main¬ 
stream  ML,  evolutionary  computation  methods  are  generally 
overlooked.  GP  suffers  from  various  issues  [7]  that  hinder 
its  applicability  to  many  real-world  data  science  tasks.  The 
theoretical  foundations  of  GP  are  not  as  well  understood 
as  many  of  the  standard  machine  learning  (ML)  algorithms 
due  to  the  hard  to  analyze  random  nature  of  the  technique. 
Scalability  is  also  a  very  challenging  problem.  Efforts  to 


increase  scalability  of  GP  via  GPUs  and  cloud  computing  have 
been  reported  (such  as  in  [8],  [9]).  It  is  our  belief  that,  if  GP- 
SR  is  to  be  a  trustable  big  learning  [10]  tool,  it  needs  to  take 
advantage  of  the  developments  in  the  general  ML  as  well  as 
the  parallel  and  distributed  computing  techniques. 

The  idea  of  studying  evolutionary  computation  techniques 
from  the  standard  ML  perspective  is  not  new.  The  behavior 
of  GP  has  been  studied  in  terms  of  the  learning  theory 
in  [1 1]  and  [12]  amongst  others.  The  learnable  evolution  model 
(LEM)  proposed  in  [13]  is  a  technique  to  guide  evolutionary 
processes  with  standard  ML  algorithms  by  creating  hypotheses 
characterizing  the  differences  between  high  performing  and 
low  performing  individuals  in  the  population. 

Recently,  it  has  been  suggested  that  GP  might  not  be  the 
best  option  for  SR  and  that  stochasticity  was  not  necessarily 
a  virtue.  In  [14],  a  deterministic  basis  function  expansion 
method  used  in  conjunction  with  a  state-of-the-art  ML  regres¬ 
sion  algorithm  was  proposed  as  an  alternative  to  GP-SR.  This 
algorithm  that  is  known  as  the  Fast  Function  Extraction  (FFX) 
has  been  reported  to  outperform  GP-SR  on  a  number  of  real- 
world  regression  problems  with  dimensionality  ranging  from 
13  to  1468.  Our  paper  shares  the  same  basic  ideology,  that  is, 
SR  should  not  stray  away  from  the  well-established  techniques 
of  ML.  However,  we  argue  that  abandoning  the  GP  approach 
might  not  be  the  best  way  for  SR.  Instead,  we  propose  to 
hybridize  the  two  approaches. 

This  paper  explores  one  way  to  incorporate  a  deterministic 
ML  regression  technique  into  GP-SR  in  order  to  improve  GP- 
SR.  We  report  results  on  a  suite  of  synthetic  datasets  that  were 
generated  to  analyze  performance  as  the  problem  difficulty 
increases.  We  believe  that  analyzing  algorithm  performance 
in  this  manner  helps  us  understand  the  strengths/weaknesses 
of  the  approach  before  tackling  more  challenging  real-world 
problems  for  which  the  ground  truth  is  hardly  ever  available. 

The  organization  of  this  paper  is  as  follows:  sections  II 
and  III  discuss  the  background  and  related  work.  Our  pro¬ 
posed  algorithm  to  hybridize  the  GP-SR  and  deterministic 
ML  approaches  is  detailed  in  section  IV.  Experimental  results 
are  presented  and  discussed  in  section  V.  Finally,  section  VI 
discusses  conclusions  and  future  work. 
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Decreases  dimensionality 

-linear  and  non-linear  projections 
of  data  into  lower  dimensional  spaces 


Does  not  alter  dimensionality 

-data  normalization, 
-standardization, 

-signal  enhancement 


Increases  dimensionality 

-non-linear  expansion 


Either  way 

-Domain  specific  feature 
extraction  operators 
such  as  edge  detection 
In  Images 


Filters 

Assessed  using 
statistical  relevance  tests 


Wrappers 

Assessed  using 
performance  of  a  learning 
algorithm 
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Fig.  1:  Feature  extraction  as  a  sequential  process  of  creating 
features  from  the  input  variables  and  then  selecting  the  most 
informative  features. 


II.  Background 

Data  dimensionality  poses  a  great  challenge  for  the  numer¬ 
ical  and  symbolic  regression  algorithms  alike.  As  the  number 
of  predictors  increases,  it  becomes  more  difficult  to  identify 
the  informative  predictors  and  to  build  accurate  models.  This 
problem  has  been  well-studied  in  ML.  The  task  of  seeking  the 
best  representation  for  a  given  dataset  in  order  to  optimize 
the  performance  of  a  ML  algorithm  is  known  as  feature 
extraction  [15].  Feature  extraction  can  be  seen  as  a  sequential 
process  where  new  features  are  first  constructed  from  the  input 
variables  and  then  the  most  informative  ones  are  selected 
amongst  the  constructed  features  (Fig.  1).  Feature  construction 
may  or  may  not  increase  data  dimensionality.  If  the  input 
variables  are  suspected  to  have  interactions,  it  is  generally  the 
practice  to  create  additional  features  via  non-linear  expansion 
(such  as  x\  *  x2)- 

As  for  feature  selection,  the  simplest  approach  is  to  rank 
the  features  with  respect  to  how  well  they  correlate  with  the 
predicted  variable.  This  method  has  the  risk  of  eliminating 
such  features  that  might  not  be  informative  by  themselves  but 
might  as  well  be  very  informative  together.  Subset  selection 
methods  aim  to  address  this  issue  by  considering  a  subset  of 
features  together.  These  techniques  are  divided  into  three  main 
groups:  filters,  wrappers  and  embedded  methods.  Filters  are 
pre-processing  techniques  that,  independent  from  the  learning 
algorithm,  select  a  subset  of  variables  with  respect  to  some 
criteria  such  as  mutual  information.  The  wrapper  techniques 
consider  the  learning  algorithm  as  a  black-box  and  select  the 
set  of  features  that  optimize  the  performance  of  the  learner. 
The  feature  subsets  are  generated  by  either  forward  selection, 
that  gradually  adds  features  or  backward  elimination,  that 
starts  with  the  whole  set  of  features  and  eliminates  least 
informative  ones.  The  wrapper  approach  is  computationally 
expensive  as  the  learning  algorithm  needs  to  be  executed  many 
times.  The  embedded  methods  incorporate  feature  selection 
within  the  learning  algorithm  itself.  The  decision  tree  algo¬ 
rithms  are  the  earliest  examples  of  embedded  methods.  More 
recent  embedded  methods  utilize  the  regularization  technique. 


Within  the  context  of  the  linear  regression  problem,  reg¬ 
ularization  refers  to  imposing  additional  constraints  on  the 
coefficients  in  order  to  reduce  overfitting.  In  linear  regression, 
given  a  multivariate  dataset  XjMx ^  =  {xi,  x*2,  •••,  x~n},  a 
matrix  of  observations,  the  response  variable  Y  is  debited  as: 

N 

Y  =  /(X)=  A, +  £&**■ 

3= 1 

The  coefficients  are  computed  via  the  least  squares  estimation 
by  minimizing  the  residual  sum  of  squares  over  the  dataset  X: 

M  N 

RSS  =  minf}(^2  -  /3 0  -  X  ^  *  Xi -T 

i—l  j= 1 

Since  the  parameters  are  computed  on  the  training  data, 
overbtting  occurs  manifesting  itself  as  large  coefficient  values. 
Therefore,  an  additional  constraint  on  the  coefficients  is  im¬ 
posed  in  order  to  tame  the  coefficients  (%2f=1  ||/3j||i  <  t).  This 
algorithm  that  is  known  as  lasso  (least  absolute  shrinkage  and 
selection  operator)  shrinks  the  coefficients  and  also  performs 
feature  elimination  since  the  l\ -norm  promotes  sparsity.  There¬ 
fore,  the  coefficients  of  uninformative  features  will  be  close 
to  0.  An  (2-norm  constraint  is  also  possible  and  it  is  called 
ridge  regression.  Ridge  regression  has  the  effect  of  grouping 
the  correlated  variables  so  that  they  are  included  in  the  model 
together  [16],  [17],  The  elastic  net  approach  [18],  [19]  is  a 
hybrid  of  lasso  and  ridge  regression  and  formulated  as: 

N 

Y  =  f(X)  =  /30  +  J2  Pi  *  +  A2||/?| ||  + Ai| \/3\\i 

3= 1 

Generally,  Ai ,  A2  are  balanced  by  debning  one  single  pa¬ 
rameter  (0  <  a  <  1)  that  is  called  the  mixing  parameter.  At 
the  extreme  values  of  a,  elastic  net  behaves  like  purely  lasso 
or  purely  ridge  regression.  A  very  large  value  of  A  forces 
all  /3s  to  be  0.  As  A  is  relaxed,  the  coefficients  start  to  take 
nonzero  values.  This  sweep  of  A  values  can  be  visualized  as 
a  regularization  path  (Fig.  2).  The  algorithm  is  named  elastic 
net  since  the  “regularization  path  is  like  a  stretchable  net  that 
retains  all  the  big  hsh”[18]. 


Fig.  2:  Regularization  path  for  elastic  net  on  a  10-dimensional 
dataset.  For  each  A,  the  ^-norm  of  the  coefficients  vector 
versus  individual  coefficient  values  are  shown.  Each  line  traces 
the  change  of  coefficient  values  for  one  variable.  At  the 
beginning,  no  features  are  selected.  Gradually,  more  features 
are  added  into  the  models  as  the  coefficients  become  non-zero. 
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This  basic  linear  regression  algorithm  applies  to  the  gener¬ 
alized  linear  models  (GLM)  of  the  form: 

N 

Y  =  f(X)=p o +  £>*&,-(*) 
j=i 

where  bj{X)  are  nonlinear  basis  functions  applied  to  the  input 
variables  in  order  to  construct  new  features. 

III.  Related  Work 

GP-SR  inherently  performs  feature  selection  when  it  finds 
sufficiently  accurate  data  models;  any  feature  that  does  not 
appear  on  the  evolved  expression  can  be  considered  redundant. 
However,  when  data  dimensionality  is  high,  the  search  space 
grows  exponentially,  making  it  difficult  for  GP-SR  to  find  good 
solutions.  The  issue  of  feature  selection  in  GP-SR  algorithms 
have  been  studied  by  various  researchers.  Koza’s  automati¬ 
cally  defined  functions  (ADF)  [20]  can  be  seen  as  a  feature 
extraction  method  within  the  context  of  SR.  A  Pareto-GP 
based  variable  selection  scheme  was  proposed  in  [21].  In  [22], 
permutation  tests  were  introduced  in  GP-SR  to  discriminate 
between  informative  and  uninformative  variables.  In  [23], 
feature  selection  capabilities  of  GP-SR  and  random  forests 
were  compared.  The  authors  report  that  when  it  finds  an 
accurate  model,  GP-SR  captures  the  important  features  that 
are  missed  by  the  random  forests  algorithm. 

The  regularization  approach  has  been  applied  to  GP-SR 
in  [24]  for  polynomial  functional  form  discovery.  The  au¬ 
thors  incorporate  a  function  smoothness  term  into  the  fitness 
function  as  a  way  to  decrease  overfitting.  The  Fast  Function 
Extraction  (FFX)  algorithm  reported  in  [14]  employs  a  nonlin¬ 
ear  basis  function  expansion  method  that  creates  new  features 
via  unary  and  binary  interactions  of  the  input  variables.  The 
algorithm  does  not  employ  GP-SR  to  construct  the  features  or 
the  models.  The  new  features  are  created  in  a  deterministic 
manner  and  passed  to  the  elastic  net  algorithm  for  model 
building.  The  algorithm  generates  multiple  models  for  the 
As  on  the  regularization  path.  The  non-dominated  set  of 
these  models  with  respect  to  accuracy  versus  complexity  are 
identified  as  the  final  models. 

The  difference  of  our  proposed  technique  is  that  we  per¬ 
form  feature  extraction  using  an  efficient  deterministic  ML 
algorithm  and  pass  the  features  to  GP-SR  for  model  building. 
By  taking  advantage  of  the  state-of-the-art  ML,  our  algorithm 
aims  to  ease  the  burden  of  GP-SR  in  feature  extraction  and 
help  it  excel  in  model  building. 

IV.  Improving  GP-SR  using  Deterministic  ML 

The  technique  we  propose  in  this  paper  has  been  largely 
inspired  by  the  LLX  algorithm  [14].  However,  the  author  had 
proposed  to  eliminate  the  GP  for  the  symbolic  regression 
problem  in  favor  of  a  deterministic  way  to  augment  the 
dataset  with  polynomial  features  and  then  use  a  state-of- 
the-art  machine  learning  algorithm  (elastic  net)  for  model 
building.  In  this  paper,  we  propose  to  hybridize  GP  with  the 
deterministic  ML  techniques  so  as  to  take  advantage  of  the 
strengths  of  both  approaches  to  solve  symbolic  regression 


problems  more  accurately  and  efficiently  in  comparison  to 
either  technique  alone.  The  outline  of  the  general  idea  behind 
LLX  is  presented  in  algorithm  1  (for  a  detailed  description 
of  the  LLX  algorithm,  see[14]).  The  algorithm  consists  of 
three  stages:  feature  construction,  model  building  and  model 
selection.  The  feature  construction  stage  creates  new  features 
by  applying  binary  nonlinear  interactions  (basis  functions)  and 
augmenting  the  original  dataset  (algorithm  2).  It  is  possible 
to  go  beyond  the  binary  interactions;  however,  this  would 
increase  the  number  of  constructed  features  exponentially.  In 
this  paper,  we  considered  only  unary  and  binary  features  as  in 
[14]). 


Algorithm  1:  The  basic  LLX  algorithm 


Input:  V={vi,V2,  ...,  vjv} 

Output:  The  set  of  non-dominated  evolved  models  based  on 
validation  data  error-model  complexity  (number  of 
bases)  trade-off 

1  [  bases,  expandedTrainingDataset]  = 
basisFunctionExtracliondnnmngDuVdset) 

2  models={} 

3  foreach  a  £  (0,  0.05,  0.1, ...,  1)  do 

4  models  =  models  |J  g/Mflet/k(variables,trainingDataset) 

5  nonDominatedModels  = 

6  erfractParetoFronriebmodels^xpandedValidationDataset) 

7  models=nonDominatedModels 


8  models  =  models  |J 

g//M«e//;f(bases,expandedTrainingDataset) 


9 

10 

li 


nonDominatedModels  = 

erfractPflretoF?'onft'er(models,expandedValidationDataset) 

models=nonDominatedModels 


The  model  building  stage  utilizes  the  coordinate  descent 
elastic  net  algorithm  (glmnet,  line  4  of  algorithm  1)  that  was 
proposed  in  [19].  As  it  is  shown  in  Fig.  2,  for  each  value  of 
A,  one  can  build  an  expression  using  the  corresponding  coef¬ 
ficients.  Therefore,  the  model  building  stage  returns  multiple 
expressions  containing  different  numbers  of  basis  functions. 


Algorithm  2:  basisFunctionExtraction  :  Polynomial  basis 
function  generation  as  new  features  form  the  observed  data 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 


Input:  V={vi,V2,  ...,  vn} 

Output:  Expanded  Dataset:  14={vei,ve2,  ...,  veM  } 
//Generate  unary  bases 
foreach  it,V2,  ...,  vjv  do 

unaryBases  =  unaryBases  |J  Vi 

foreach  expj  do 

j  unaryBases  =  unaryBases  |J  Vi  exD 

end 

foreach  unaryOperator^  do 
j  unaryBases  =  unaryBases  |J  unaryOperatorjAvr) 

end 

end 

//Generate  binary  bases 
foreach  u,  £  unaryBases  do 
foreach  «,  £  unaiyBases  do 
j  binaryBases=  binaryBases  |J  m*Uj 

end 
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Note  that  the  models  are  built  on  the  training  data.  At 
the  model  selection  stage,  the  non-dominated  set  of  models 
with  respect  to  error  on  validation  data  versus  expression 
complexity  (the  number  of  basis  functions  or  bases  for  short) 
are  identified. 

Our  proposed  method  to  hybridize  FFX/GP-SR  is  presented 
in  algorithm  3.  The  process  starts  with  a  variant  of  FFX  that 
was  outlined  in  algorithm  1 .  From  the  set  of  all  non-dominated 
models  generated  by  FFX,  all  unique  features  (unary  and 
binary)  are  extracted  (line  2).  These  are  the  features  that  were 
found  by  FFX  to  be  the  most  informative  features  for  the 
given  regression  problem.  Fig.  3  summarizes  the  process  of 
identifying  these  features  from  the  FFX  output.  Across  all  the 
models  on  the  non-dominated  set,  each  base  is  extracted  and 
the  coefficients  are  eliminated.  The  identified  list  of  unique 
unary  and  binary  basis  functions  are  then  utilized  to  create 
the  new  dataset  with  corresponding  feature  labels.  The  new 
dataset  is  then  passed  onto  the  GP-SR  for  model  building. 


Algorithm  3:  The  hybrid  FFX/GP-SR  algorithm 

Input:  V={vi,V2,  ....  vjv} 

Output:  One  best  model  with  respect  to  the  validation  data 
error  and  complexity 

t  nonDominatedModels  =  jf/VftrainingDataset) 

2  bases  =  extractBasisFunctions  (nonDominatedModels, 

3  validationDataset) 

4  ne'wDa.taset=createNewDataset(ba.ses) 
s  bestModel  =  GP-SR(newDataset) 


We  hypothesized  that  for  higher  dimensional  problems,  pre¬ 
processing  the  dataset  using  a  fast  algorithm  such  as  FFX 
would  increase  the  chances  of  the  GP-SR  to  succeed  as 
opposed  to  expecting  the  GP-SR  to  perform  feature  extraction 
and  model  building  simultaneously.  The  algorithm  shown 
above  may  extract  many  basis  functions  for  high  dimensional 
datasets.  In  that  case,  further  filtering  of  the  uninformative 
features  created  by  those  basis  functions  can  be  done  before 
passing  the  features  to  the  GP-SR. 


Fig.  3:  Generation  of  the  new  dataset  based  on  FFX-generated 
expressions  (extractBasisFunctions,  line  2  of  algorithm  3). 
Most  frequent  bases  are  extracted  from  the  Pareto  frontier  and 
used  as  features  for  the  new  dataset. 


V.  Experimental  Results 

We  implemented  our  GP-based  Symbolic  Regression  ap¬ 
plication  using  the  GPTIPS  Matlab  package  downloaded 
from  [25].  Our  version  of  the  FFX  algorithm  and  FFX/GP-SR 
algorithms  were  also  implemented  in  Matlab  using  the  glmnet 
package  downloaded  from  [26]  and  the  GPTIPS  package.  All 
experiments  were  run  on  a  cluster  computing  environment. 

A.  Synthetic  Benchmark  Data  Suite  and  Evaluation  Procedure 

We  tested  our  algorithms  on  a  systemically  generated  suite 
of  multivariate  polynomial  functions  in  order  to  analyze  the 
performance  as  the  difficulty  of  the  problem  is  increased 
in  terms  of  the  number  of  variables(l-3,  10),  order  of  the 
polynomial  (1-4)  and  the  number  of  basis  functions  each 
polynomial  contains  (1-4).  Examples  of  such  functions  are 
presented  in  the  following  sections.  For  each  polynomial,  2500 
data  points  were  generated  as  training  points  and  separate  sets 
of  1250  data  points  were  held  aside  as  validation  and  test  data. 
All  input  variables  were  randomly  sampled  within  the  range 
[0,1]. 

The  evaluation  procedure  is  as  follows:  for  each  type  of 
polynomial  (such  as  order  2  with  2  bases),  there  are  30  differ¬ 
ent  datasets  generated  by  the  30  different  polynomials  of  that 
type.  For  each  such  polynomial,  we  perform  30  independent 
GP-SR  and  FFX/GP-SR  runs  with  1  minute  runtime  budget. 
Since  FFX  is  deterministic  it  runs  only  once.  For  FFX,  the 
final  set  of  non-dominated  models  are  recorded.  For  GP-SR 
and  FFX/GP-SR  the  best  model  with  respect  to  the  validation 
dataset  is  recorded  for  each  run.  In  summary,  for  each  type 
of  polynomial,  900  runs  of  GP  and  900  runs  of  FFX/GP-SR 
runs  are  performed. 

Unlike  the  general  approach  where  a  close  approxima¬ 
tion  with  respect  to  the  prediction  error  is  satisfactory  for 
evaluation  of  the  success,  in  this  paper,  we  also  assess  the 
outcomes  in  terms  of  how  close  the  functional  form  of  the 
hidden  target  expression  is  matched.  For  instance,  if  the  hidden 
target  expression  is  a\  *  x\  +  a.2  *  £2  +  P,  where  a.i,p 
are  real  valued  coefficients,  we  consider  each  evolved  model 
with  low  prediction  error  that  matches  this  functional  form 
as  a  successful  outcome  regardless  of  the  actual  values  of 
the  coefficients.  Namely,  the  degree  of  similarity  between  the 
hidden  ground  truth  and  the  evolved  polynomials  is  defined  in 
terms  of  syntactic  similarity. 

For  FFX,  the  evaluation  is  performed  based  on  the  whole 
set  of  non-dominated  models  (Fig.  4).  If  a  model  with  the 
correct  syntactic  form  exists  in  this  set,  then  the  FFX  run  is 
considered  a  success.  For  GP  and  FFX/GP,  all  30  runs  per 
unique  polynomial  are  examined.  If  a  model  with  the  correct 
syntactic  form  exists  in  this  set,  the  algorithm  is  considered 
successful  on  discovering  that  polynomial.  We  also  record 
the  syntactic  similarity  to  the  correct  polynomial  form.  The 
similarity  values  range  between  0  and  1;  1  meaning  a  perfect 
match  to  the  true  syntactic  form  and  0  meaning  no  match  at  all. 
For  each  unique  polynomial,  the  model  with  best  validation 
error  is  identified  and  its  syntactic  similarity  and  test  error 
values  are  recorded  as  the  outcomes  for  each  algorithm. 


81 

Approved  for  public  release;  distribution  is  unlimited. 


true  model:  y=o.l523*Xl  2  -  0.4442*X1  +  0.324 


row 


Fig.  4:  Result  of  an  FFX  run  on  a  second  order  ID  problem 
with  two  basis  functions.  The  final  model  shown  here  is 
selected  from  the  Pareto  frontier  as  the  one  with  lowest 
validation  error. 

In  the  following  sections,  we  present  the  results  separately 
for  five  sets  of  data  organized  with  respect  to  dimensionality. 
We  start  from  the  simplest  case:  one  variable  and  various 
polynomials  ranging  from  linear  with  a  single  term  to  fourth 
order  with  four  terms.  We  increase  the  number  of  variables 
to  2,3,10  then  to  25  and  repeat  the  same  set  of  experiments. 
Finally,  statistical  significance  tests  are  utilized  in  order  to 
check  if  hybridizing  the  GP-SR  with  FFX  helps  improve  the 
performance  of  GP-SR  given  the  data  dimensionality. 

B.  Results  on  the  benchmark  problems 

Unless  otherwise  specified,  all  GP-SR  and  FFX  runs  were 
performed  using  the  following  default  parameters: 


TABLE  I:  Default  GP-SR  parameters 


Parameter 

Representation 

Value 

GPTIPS  [25]  Multigene  syntax  tree 
Number  of  genes:  1 
Maximum  tree  depth:  7 

Population  Size 

500 

Runtime  Budget 

1  minute 

Selection 

Lexicographic  tournament  selection 

Tournament  Size 

7 

Crossover  Operator 

Sub-tree  crossover 

Crossover  Probability 

0.85 

Mutation  Operator 

Sub-tree  mutation 

Mutation  Probability 

0.1 

Reproduction  Probability 

0.05 

Building  Blocks 

Operators :  protected /  } 

Terminal  Symbols:  {xi, ..., xjy} 

Fitness 

iVE  (y-y)2 

Elitism 

Keep  1  best  individual 

1 -dimensional  polynomials:  The  following  polynomials  are 
examples  of  the  1 -dimensional  hidden  target  expressions 
(ground  truth)  used  in  our  experiments.  The  polynomials  are 
grouped  with  respect  to  the  highest  order  variable  interaction. 
Within  each  group,  the  syntactic  complexity  of  the  expressions 
increase  as  more  basis  functions  are  included  gradually.  For 
each  expression,  the  number  in  the  paranthesis  on  the  left- 
hand  side  indicates  how  many  types  of  nonlinear  interactions 
(i.e  unary,  binary,...)  are  included  in  that  expression. 


TABLE  II:  Default  FFX  parameters 


Parameter 

Value 

Basis  Function  Expansion 

Exponents  :  1 

Interactions  :  Unary,  Binary 
Operators  :  {  } 

Elastic  Net 

a  :  {0,  0.05,  0.1,  ...,1} 

A  :  100  A  values  calculated  by 
glmfit  based  on  a 

Maximum  basis  functions  allowed 
:  250 

Model  Selection 

Non-dominated  models  with  re¬ 
spect  to  validation  data  error  versus 
number  of  bases 

•  order  1  polynomial: 

(1)  y  =0.288  *  xi  +  0.8446 

•  order  2  polynomials: 

(1)  y  =0.14*  xf  +  0.629 

(2)  y  =0.12*  xi  +0.03*  xf  +  0.29 

•  order  3  polynomials: 

(1)  y  =  —  0.31  *  xf  —  0.11 

(2)  y  =1.35  **1-0.83*0:?  +  0.139 

(3)  y  =0.13  *  xi  +  0.44  *  xf  +  0.34  *  xf  +  0.39 

•  order  4  polynomials: 

(1)  y  =0.20*  xf  +  0.13 

(2)  y  =0.24*  xf  +  0.23  *4  +  0.39 

(3)  y  =0.75  *  xf  +  0.30  *  xf  +  0.35  *  x\  +  0.334 

(4)  y  =0.02  *  xi  +  0.13  *  xf  +  0.301  *  xf+ 

0.32*4  +  0.91 

Tables  III,  IV  and  V  show  the  number  of  successful  runs 
for  each  algorithm  for  each  type  of  polynomial.  As  the  results 
of  the  GP-SR  runs  indicate,  even  the  1 -dimensional  hidden 
target  expressions  become  more  challenging  as  the  number  of 
basis  functions  increases.  Out  of  the  30  runs,  the  proportion 
of  successful  discovery  of  the  correct  functional  form  declines 
as  the  syntactic  complexity  of  the  target  expressions  increase. 

TABLE  III:  Standalone  GP-SR  runs  on  ID  datasets  (1  minute). 


Bases 

1 

2 

3 

4 

1 

30 

- 

Order  of  the  Polynomial 

2 

30 

29 

- 

3 

30 

27 

19 

- 

4 

30 

27 

11 

16 

TABLE  IV:  FFX  runs  on  ID  datasets  with  unary  (at,)  and 
binary  interactions  ( Xi  *  Xj)  (average  run  time:  7  seconds) 


Bases 

1 

2 

3 

4 

1 

30 

- 

- 

Order  of  the  Polynomial 

2 

30 

30 

- 

3 

0 

0 

0 

4 

0 

0 

0 

0 

TABLE  V:  FFX/GP-SR  runs  on  ID  datasets  (1  minute  GP  run 
on  FFX-generated  dataset) 


Bases 

1 

2 

3 

4 

1 

30 

- 

Order  of  the  Polynomial 

2 

30 

27 

- 

3 

30 

26 

19 

- 

4 

30 

28 

16 

17 
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It  is  not  surprising  that  FFX  did  not  succeed  at  all  when  the 
target  polynomials  were  cubic  and  fourth  order,  as  we  have 
only  allowed  for  unary  and  binary  basis  functions.  Our  goal 
was  to  test  how  much  FFX  might  help  GP-SR  discover  3rd 
and  4th  order  polynomials  utilizing  binary  bases  only.  Fig.  5 
shows  that  the  hybrid  did  not  outperform  the  plain  GP-SR 
even  on  the  quadratic  polynomials  as  the  problem  was  easy 
for  GP-SR  to  handle  within  the  given  runtime  budget. 


xio*  order:2.  bases:2 


Fig.  5:  Summary  of  runs  on  second  order  ID  polynomials 
with  2  basis  functions.  According  to  Wilcoxon  rank  sum  tests, 
FFX/GP-SR  does  not  outperform  GP-SR  in  1  minute  runtime 


2-dimensional  polynomials:  We  repeated  the  experiments 
using  a  set  of  30  polynomials  for  each  listed  form  below: 

•  order  1  polynomial: 

(1)  y  =0.62  *  X2  —  0.854 

•  order  2  polynomials: 

(1)  y  =0.22  *  x\  +  0.05 

(2)  y  =0.12  *  xi  —  0.25  *  Xi  *  X2  +  0.4 

•  order  3  polynomials: 

(1)  y  =1.67  *  x\  *  X2  +  0.46 

(2)  y  =0.17  *  xi  *  X2  +  0.369  *  xl  —  0.3 

(3)  y  =0.03  *  X2  —  0.36  *  x\  +  0.22  *  xl  4-  0.42 

•  order  4  polynomials: 

(1)  y  =2.88  *  xj  *  xl  +  0.15 

(3)  y  =0.4978  *  *1  *  xl  —  0.08  *  x\  +  0.36 

(3)  y  =2.19  *  xi  *  xl  —  0.87  *  xl  +  0.87  *  x\  *  x\  +  0.39 

(4)  y  =0.13  *  X2  —  1.313  *  xi  *  X2  —  0.1  *  xl 

0.4926  *  x\  *  xl  +  0.19 


TABLE  VI:  Standalone  GP-SR  runs  on  2D  datasets  (1  minute). 


Bases 

1 

2 

3 

4 

1 

30 

- 

- 

- 

Order  of  the  Polynomial 

2 

30 

29 

- 

- 

3 

30 

22 

13 

- 

4 

30 

20 

10 

2 

TABLE  VII:  FFX  runs  on  2D  datasets  with  unary  ( Xi )  and 
binary  interactions  ( Xi  *  x;/ )  (average  run  time:  9  seconds) 


Bases 

1 

2 

3 

4 

1 

30 

- 

- 

Order  of  the  Polynomial 

2 

30 

16 

- 

- 

3 

0 

0 

0 

- 

4 

0 

0 

0 

0 

Similar  to  the  1 -dimensional  case,  we  found  that  FFX/GP- 
SR  did  not  outperform  GP-SR  on  2-dimensional  polynomial 
dataset  (Fig.  6). 


TABLE  VIII:  FFX/GP-SR  runs  on  2D  datasets  (1  minute  GP 
run  on  FFX-generated  dataset) 


Bases 

1 

2 

3 

4 

1 

30 

- 

Order  of  the  Polynomial 

2 

30 

30 

- 

3 

30 

19 

14 

- 

4 

30 

20 

11 

3 

order:2,  bases:2  x10*  order:2.  bases:2 


Fig.  6:  Summary  of  runs  on  second  order  2D  polynomials 
with  2  basis  functions.  According  to  Wilcoxon  rank  sum  tests, 
FFX/GP-SR  does  not  outperform  GP-SR  in  1  minute  runtime 


3-dimensional  polynomials:  We  repeated  the  experiments 
using  a  set  of  30  polynomials  for  each  listed  form  below:: 

•  order  1  polynomial: 

(1)  y  =0.746  *  x3  +  0.8268 

•  order  2  polynomials: 

(1)  y  =0.54  *  xl  +  0.4 

(2)  y  =0.8651  *  xi  -  0.61  *  x\  -  0.30 

•  order  3  polynomials: 

(1)  y  =0.84  *  xi  *  X2  *  X3  —  0.86 

(2)  y  =0.93  *  xi  *  X2  —  0.46  *  *3  +  0.88 

(3)  y  =0.04  *  X2  —  0.18  *  X2  *  *3  —  0.01  *  xi  *  x\  +  0.3 

•  order  4  polynomials: 

(1)  y  =0.20  *  xi  *  xl  +  0.91 

(2)  y  =0.73  *  x\  *  X2  —  0.07  *  x\  *  X2  *  £3  +  0.39 

(3)  y  =1.2  *  xi  *  X2  +  0.68  *  x\  *  *2+ 

0.48  *  x\  *  X2  *  X3  +  0.41 

(4)  y  =0.35  *  X3  —  0.32  *  X2  *  £3  —  0.35  *  xi*  xl  — 

0.39*4  +  0.24 


TABLE  IX:  Standalone  GP  runs  on  3D  datasets  (1  minute) 


Bases 

1 

2 

3 

4 

1 

30 

- 

Order  of  the  Polynomial 

2 

30 

23 

- 

3 

30 

23 

9 

- 

4 

30 

13 

12 

3 

TABLE  X:  FFX  runs  on  3D  datasets  with  unary  (x^  and 
binary  interactions  ( Xi  *  Xj)  (average  run  time:  12  seconds) 


Bases 

1 

2 

3 

4 

1 

30 

- 

- 

Order  of  the  Polynomial 

2 

29 

16 

- 

3 

0 

0 

0 

4 

0 

0 

0 

0 
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TABLE  XI:  FFX/GP-SR  runs  on  3D  datasets  (1  minute  GP 
run  on  FFX-generated  dataset) 


Bases 

1 

2 

3 

4 

1 

30 

- 

- 

Order  of  the  Polynomial 

2 

30 

26 

- 

- 

3 

30 

28 

14 

- 

4 

30 

17 

12 

6 

order:2.  bases:2  xio"  order:2,  bases:2 


Fig.  7:  Summary  of  runs  on  second  order  3D  polynomials 
with  2  basis  functions.  According  to  Wilcoxon  rank  sum  tests, 
FFX/GP-SR  does  not  outperform  GP-SR  in  1  minute  runtime 


hybrid  algorithm  becomes  more  significant  as  the  number  of 
variables  increases.  The  hybrid  algorithm  discovers  expres¬ 
sions  that  are  significantly  more  similar  to  the  ground  truth 
and  significantly  more  predictive. 
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order:2.  bases:2  xioJ  order:2.  bases:2 


Fig.  8:  Summary  of  runs  on  second  order  10D  polynomi¬ 
als  with  2  basis  functions.  The  final  expressions  found  by 
FFX/GP-SR  are  significantly  more  similar  to  the  ground  truth 
as  opposed  to  GP-SR  alone  (Wilcoxon  rank  sum  right-tailed 
test,  a  =  0.05,  p-value:0.0198)  and  more  predictive  (Wilcoxon 
rank  sum  left-tailed  test,  a  =  0.05,  p-value:0.005) 


Second  Order  Polynomials  from  10  &  25-dimensional 
Datasets:  In  order  to  test  our  intuition  that  the  FFX/GP-SR 
would  perform  better  for  higher  dimensional  data,  we  raised 
the  dimensionality  of  synthetic  data  to  10  and  25.  In  this 
section,  we  present  results  of  GP-SR  and  FFX/GP-SR  runs  on 
30  second  order  polynomial  functions  with  2  basis  functions 
such  as  the  following:  y  =  0.7 *0:3  —  0.23* xg* *7  +  0.2  where 
Xi  €  a:i,...,a;io  and  Xi  £  Xi,...,X25  respectively. 

Since  we  only  allowed  unary  and  binary  interactions  in 
our  FFX  implementation,  FFX/GP-SR  did  not  significantly 
do  better  than  GP-SR  alone  in  terms  of  finding  the  correct 
functional  form  of  the  hidden  polynomials  with  orders  greater 
than  2.  This  was  evident  in  l:3-dimensional  polynomial  ex¬ 
periments  reported  in  the  previous  section.  Therefore,  we  only 
performed  runs  on  the  second  order  polynomials  for  the  10  and 
25 -dimensional  data.  As  in  the  previous  sections,  the  results 
reported  here  are  aggregated  over  the  runs  on  30  different 
polynomials  for  a  runtime  budget  of  1  minute.. 

On  the  10-dimensional  dataset,  the  FFX  algorithm  found 
the  correct  syntactic  form  (identified  the  correct  variables  and 
linear  form)  for  10  out  of  the  30  polynomials.  The  GP-SR 
algorithm  by  itself  found  14  out  of  the  30  and  the  FFX/GP- 
SR  hybrid  found  22  out  of  the  30  target  polynomial  forms 
correctly.  On  the  25 -dimensional  dataset,  the  number  of  times 
each  algorithm  found  the  correct  functional  form  was  18,1  and 
26  out  of  the  30  target  polynomials  for  the  FFX,  GP-SR  and 
FF/GP-SR  respectively. 

Fig.  8  and  Fig.  9  summarize  the  comparisons  of  GP  and 
FFX/GP-SR  algorithms  based  on  the  similarity  to  the  correct 
polynomial  form  and  prediction  errors.  As  the  dimensionality 
increases  from  10  to  25,  the  performance  of  the  GP-SR 
declines  sharply  in  terms  of  recovering  the  correct  functional 
form  within  the  given  runtime  budget  of  1  minute.  The 
FFX/GP-SR  hybrid,  on  the  other  hand,  continues  to  succeed 
as  the  dimensionality  increases.  In  summary,  the  utility  of  the 


order:2.  bases:2  xioJ  order:2.  bases:2 


Fig.  9:  Summary  of  runs  on  second  order  25D  polynomi¬ 
als  with  2  basis  functions.  The  final  expressions  found  by 
FFX/GP-SR  are  significantly  more  similar  to  the  ground 
truth  as  opposed  to  GP-SR  alone  (Wilcoxon  rank  sum  right¬ 
tailed  test,  a  =  0.001,  p-value  <<  0.001)  and  more  pre¬ 
dictive  (Wilcoxon  rank  sum  left-tailed  test,  a  =  0.01,  p- 
value«0.01) 

C.  Discussion 

Even  though  the  hybrid  algorithm  did  not  provide  additional 
advantage  over  plain  GP-SR  on  low  dimensional  datasets,  our 
results  indicated  that,  as  the  data  dimensionality  increased  (10 
and  then  to  25,  in  this  case),  the  FFX/GP-SR  hybrid  performed 
significantly  better  in  finding  more  predictive  expressions  that 
are  more  similar  to  the  hidden  ground  truth  in  comparison 
to  GP-SR  alone.  By  similar,  we  mean  the  success  at  which 
the  algorithm  captures  the  informative  variables  and  their 
nonlinear  interactions. 

Based  on  our  experiment  results,  we  note  that  even  though 
the  FFX  algorithm  might  not  always  find  the  correct  func¬ 
tional  form  for  the  target  expressions  itself,  the  rich  set  of 
building  blocks  it  provides  to  the  GP-SR  has  the  potential 
to  boost  the  performance  of  the  GP-SR.  Since  the  GP-SR 
search  space  grows  exponentially  as  the  number  of  variables 
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increases,  eliminating  the  uninformative  variables  beforehand 
using  a  deterministic  ML  algorithm  helps  ease  the  burden 
of  discovering  the  informative  variables  and  constructing  the 
useful  nonlinear  interactions  for  model  building. 

VI.  Conclusion 

Although  GP-SR  has  been  known  for  a  couple  of  decades, 
only  recently  that  tools  such  as  Eureqa  have  started  to  attract 
a  larger  number  of  scientists  due  to  almost  zero-maintenance 
and  user-friendly  application  interfaces  along  with  various 
improvements  on  the  metaheuristics  and  options  for  parallel 
and  distributed  computing.  However,  GP-SR  is  yet  to  be 
accepted  as  a  standard  data  analysis  tool.  In  this  paper,  we 
argued  that  the  resistance  from  the  ML  community  is  not 
totally  unfounded.  First  of  all,  theoretical  underpinnings  of  the 
GP-SR  such  as  converge  proofs  are  not  as  well  established  as 
standard  deterministic  algorithms.  GP-SR  is  computationally 
more  expensive  compared  to  most  standard  ML  algorithms 
and  even  though  many  intuitive  strategies  might  be  built  into 
the  algorithm,  there  is  no  guarantee  that  optimal  data  models 
will  emerge  at  the  end  of  the  run.  More  importantly,  despite 
all  the  success  stories,  GP-SR  techniques  do  not  necessarily 
outperform  the  state  of  the  art  in  ML,  especially  on  high 
dimensional  problems.  On  the  other  hand,  the  strength  of  GP- 
SR  is  in  its  model-free  nature  which  makes  it  possible  that  the 
algorithm  might  discover  optimal  and  more  intelligible,  novel 
models  for  the  observed  data.  In  summary,  it  is  our  belief  that 
stochasticity  can  be  a  virtue  for  SR  if  it  is  directed  intelligently. 

In  this  paper,  we  showed  that  it  would  be  possible  to  create 
synergy  between  the  deterministic  ML  and  GP-SR  approaches 
by  hybridizing  them.  The  technique  presented  in  this  paper  is 
just  one  way  out  of  many  possible  options  to  combine  the 
GP-SR  and  standard  ML  for  regression  problems.  Here,  we 
incorporated  building  blocks  extracted  by  the  deterministic 
regression  algorithm  into  the  GP-SR  algorithm  by  means  of  re¬ 
creating  the  input  dataset.  Another  option  is  to  seed  the  GP-SR 
runs  with  the  candidate  solutions  found  by  the  deterministic 
approach.  Genetic  programming  can  also  be  used  to  evolve 
features  (via  the  generation  of  the  basis  functions)  that  can 
be  fed  to  the  deterministic  algorithm  for  model  generation. 
Our  current  work  focuses  on  investigating  other  possible  ways 
to  hybridize  GP-SR  and  deterministic  ML  based  approaches 
in  order  to  address  high  dimensional  real-world  regression 
problems. 
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5.3  Icke  et  al.  “Modeling  hierarchy...”  (2013). 

A  technical  manuscript  describing  how  symbolic  regression  can  find  hidden  hierarchy  in  data  fol¬ 
lows. 
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Abstract — Symbolic  Regression  is  an  attractive  modeling  ap¬ 
proach  because  it  can  capture  and  present,  mathematically, 
relationships  between  variables  of  interest.  However,  given  n 
variables  to  model,  symbolic  regression  returns  a  flat  list  of 
n  equations.  As  the  number  of  state  variables  to  be  modeled 
scales,  interpretation  of  such  a  list  becomes  difficult.  Here  we 
present  a  symbolic  regression  method  that  detects  and  captures 
hidden  hierarchy  in  a  given  system.  The  method  returns  the 
equations  in  a  hierarchical  dependency  graph,  which  increases 
the  interpretability  of  the  results.  We  demonstrate  that  two 
variations  of  this  hierarchical  modeling  approach  outperform 
non-hierarchical  symbolic  regression  on  a  synthetic  data  suite. 

Index  Terms — hierarchy,  dependency  graph,  data  mining 

I.  Introduction 

Hierarchical  relationships  abound  in  natural  and  man-made 
systems.  Hierarchy  is  thought  to  be  a  fundamental  characteris¬ 
tic  of  many  complex  systems  such  as  biological  organisms  [1], 
ecological  systems  [2],  the  Internet,  and  traffic  networks  [3] 
and,  arguably,  social  organizations  [4].  The  human  visual 
system  is  known  to  be  organized  hierarchically  [5],  where 
the  lower  level  components  process  the  sensory  stimuli  and 
the  higher  levels  process  the  output  of  the  lower  level  com¬ 
ponents.  Many  artificial  neural  network  architectures  used 
for  pattern  recognition  tasks  were  also  designed  based  on 
this  principle  [6].  More  recently,  deep  belief  networks  [7] 
attempt  to  discover  the  hierarchical  structure  hidden  in  large 
data  sets  by  learning  several  layers  of  hierarchically  organized 
features.  These  natural/artificial  hierarchical  systems  have  been 
evolved/trained  to  be  able  to  respond  to  a  wide  variety  of 
stimuli.  In  this  scheme,  each  component  is  specialized  in 
processing  a  subset  of  the  inputs  coming  from  the  lower  level. 

Our  goal  is  to  be  able  to  automatically  reverse  engineer 
hierarchical  systems  in  order  to  understand  which  inputs  each 
component  is  processing  and  uncover  the  nature  of  the  process 
(i.e  how  each  component  is  computing  its  output  from  its 
inputs).  In  this  paper,  we  focus  on  systems  where  the  inputs  of 
the  individual  components  are  not  overlapping  (Fig.  1)  such  as 
the  non-overlapping  perceptron  or  biological  neural  networks 
with  non-overlapping  receptive  fields  [8],  [9], 

Here  we  show  that  if  traditional  symbolic  regression  is 
applied  (in  which  each  variable  is  modeled  separately  but 
allowed  to  be  described  as  a  function  of  every  other  variable) 
little  progress  is  made  on  increasingly  large  yet  hierarchical 


/  X 


V3  =  h(vl,v2) 

V2=g(s3,s+) 


Vl=f(sl,s2.) 

\ _ 


Fig.  1:  Dependency  graph  for  a  hierarchical  system  with  non¬ 
overlapping  inputs.  The  leaf  nodes  are  the  stimuli  (controlled) 
and  internal  nodes  are  the  state  variables  whose  behaviors  are 
observed.  The  direction  of  the  arrows  indicate  dependency  that 
is  opposite  to  the  information  flow  in  the  system. 


target  systems.  This  is  because  as  the  number  of  variables 
increases,  more  independent  runs  must  be  performed  (one 
for  each  variable),  and  each  run  is  more  difficult  because 
there  are  more  variables  to  make  use  of.  In  contrast  we 
present  two  variations  of  symbolic  regression  adapted  for 
hierarchical  systems:  the  variables  that  are  directly  influenced 
by  variables  at  the  lowest  level  of  the  hierarchy  are  identified 
and  modeled  first,  followed  by  the  variables  at  the  next-highest 
level  but  which  are  restricted  to  using  the  variables  on  the  layer 
below  them.  We  find  that  hierarchical  approach  significantly 
outperforms  the  traditional  symbolic  regression  paradigm  on 
a  number  of  synthetic  datasets  that  vary  in  difficulty. 

This  paper  is  organized  as  follows:  section  II  presents  the 
problem  in  terms  of  a  motivating  example  and  section  III 
discusses  the  related  work.  Our  proposed  approaches  for  au¬ 
tomatic  identification  of  hierarchy  are  presented  in  section  IV. 
Experimental  results  are  presented  in  section  V.  Finally,  con¬ 
clusions  and  future  work  are  discussed  in  section  VI. 

II.  Identifying  Hierarchy 

Many  systems  are  made  up  of  multiple  components  that 
interact  with  each  other  by  receiving  inputs  from  and/or 
transmitting  outputs  to  other  components.  In  some  cases, 
these  components  are  hierarchically  organized  where  the  in¬ 
formation  flows  in  a  bottom  up  manner.  In  a  hierarchical 
organization,  the  output  of  each  component  depends  on  the 
inputs  it  receives  from  the  components  at  the  lower  levels. 
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Fig.  1  shows  a  very  simple  hierarchical  system  that  receives 
4  stimuli  (si,  s2,  S3,  S4).  The  stimuli  are  then  processed  by  the 
two  components  (tq  and  r;2)  and  their  outputs  are  passed  on  to 
the  top  component  V3.  In  this  system,  the  stimuli  are  provided 
by  the  environment  (or  controlled  by  an  experimenter)  and 
the  behavior  of  v\ ,  t>2  and  V3  are  observed.  It  is  important  to 
note  that,  only  the  stimuli  are  known  and  the  responses  of 
Vi ,  V'2  and  V3  to  these  external  inputs  are  observed,  but  the 
actual  connectivity  between  the  components  are  not  known. 
The  goal  is  to  identify  which  components  are  connected  and 
how  they  are  connected. 

In  order  to  understand  the  nature  of  the  relationships  be¬ 
tween  the  inputs  and  outputs  of  each  component  (the  functions 
f,  g  and  h  in  Fig.  1),  one  would  employ  a  regression  approach. 
In  this  regard,  being  a  free-form  modeling  technique,  symbolic 
regression  can  be  seen  as  a  more  flexible  approach  as  opposed 
to  the  mainstream  linear  regression  techniques. 

The  algorithms  we  present  in  this  paper  are  built  around 
a  genetic  programming  based  implementation  of  symbolic 
regression  (SR)  in  order  to  model  the  relationships  between 
the  components  of  an  hierarchical  system.  In  symbolic  regres¬ 
sion,  the  most  predictive  variables  will  appear  in  the  evolved 
expressions  eliminating  the  non-informative  variables.  In  the 
context  of  this  paper,  if  a  variable  is  predictive  of  another,  we 
say  that  the  predicted  variable  depends  on  it.  The  nature  of  this 
dependency  can  be  further  examined  by  looking  at  the  evolved 
expression  relating  the  predicted  variable  to  its  predictors. 

III.  Related  Work 

A  genetic  programming  based  method  for  modeling  the 
ODEs  for  gene  regulatory  networks  was  presented  in  [10].  The 
algorithm  independently  evolves  expressions  to  explain  each 
state  variable  on  the  observed  time  series  data  and  does  not 
explicitly  model  hierarchy.  This  work  was  followed  by  several 
other  papers  that  produced  flat  lists  of  ODEs  when  exposed  to 
multivariate  datasets  [11],  [12],  A  linear  genetic  programming 
based  reverse  engineering  algorithm  for  neuronal  networks 
was  presented  in  [13].  Starting  from  the  observed  data,  the 
algorithm  tries  to  infer  the  structure  and  parameters  of  the 
system.  Each  state  variable  is  evolved  with  respect  to  the 
neuroscience  domain  knowledge  that  is  built  into  the  algorithm 
which  limits  the  use  of  the  algorithm  beyond  that  specific 
problem  domain. 

The  idea  of  building  variable  interaction  networks  from 
multivariate  data  was  explored  in  several  papers.  An  algorithm 
that  extracts  a  linear  dependency  structure  from  multivari¬ 
ate  data  without  explicitly  modeling  hierarchy  was  reported 
in  [14].  The  limitation  of  the  algorithm  is  that  it  uses  linear 
regression  in  order  to  construct  a  linear  dependency  tree  or 
forest  of  the  variables.  Linear  models  are  easier  to  build  and 
they  are  intuitive,  however  there  is  no  guarantee  that  the 
phenomenon  under  study  is  governed  by  linear  relationships. 
In  [15],  multiple  genetic  programming  based  symbolic  re¬ 
gression  runs  are  executed  for  each  variable  separately  and 
a  variable  interaction  network  is  built  by  identifying  the  most 
relevant  variables  for  a  given  target  variable  in  terms  of  a 
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Fig.  2:  The  workflow  of  the  NSR  algorithm 


measure  of  relative  frequencies  of  variable  appearances  in  the 
expressions  modeling  that  target  variable.  The  algorithm  does 
not  assume  any  specific  network  topology  such  as  hierarchy. 

IV.  Symbolic  Regression  Approach  to  Model 
Hierarchy  in  Multivariate  Data 

In  this  section,  we  present  three  approaches  to  automatically 
extract  the  hierarchical  relationships  in  multivariate  data.  The 
Naive  SR  algorithm  models  each  state  (non-stimuli)  variable 
separately  using  symbolic  regression.  After  the  run  is  com¬ 
pleted,  the  hierarchy  is  extracted  by  examining  the  expressions 
and  identifying  which  variables  depend  on  each  other.  The 
other  two  approaches  are  iterative  that  aim  to  enforce  hierarchy 
during  the  search  for  the  optimal  symbolic  expressions. 

A.  The  Naive  SR 

In  identifying  the  relationships  between  multiple  variables,  a 
straight-forward  approach  is  to  model  each  variable  separately 
in  terms  of  all  other  variables  using  symbolic  regression.  In 
doing  this,  one  would  expect  that  the  best  models  would  reveal 
the  most  informative  (highly  predictive)  variables  for  each 
modeled  variable.  After  the  symbolic  regression  phase  is  done, 
constructing  the  hierarchy  is  just  a  matter  of  post-processing. 
Because  each  variable  is  modeled  independently,  the  algorithm 
does  not  impose  any  constraints  on  the  connectivity.  The 
workflow  of  the  Naive  SR  is  presented  in  Fig.  2  and  the  steps 
of  the  method  are  outlined  in  algorithm  1. 

After  each  non-stimulus  variable  is  modeled  separately  on 
the  training  dataset  using  symbolic  regression  (the  evolve 
phase),  the  post-processing  phase  begins.  At  this  stage,  the  set 
of  all  non-dominated  models  for  each  variable  are  evaluated  on 
the  validation  dataset.  Then,  the  best  model  for  each  variable 
is  chosen  (line  8  of  algorithm  1)  as  the  model  with  the  lowest 
error  on  the  validation  dataset.  The  ties  are  broken  in  favor 
of  the  simplest  model.  Following  the  selection  of  the  best 
models,  the  variables  appearing  in  these  models  are  identified 
as  the  predictors  for  each  respective  modeled  variable  (line 
10).  An  adjacency  matrix  is  then  built  (line  13)  based  on 
these  identified  predicted  variable -predictor  mappings.  Finally, 
the  algorithm  returns  the  adjacency  matrix  representing  the 
connectivity  between  the  variables  along  with  the  set  of  best 
models  evolved  for  each  non-stimulus  variable. 
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Algorithm  1:  Naive  SR  (NSR)  Algorithm 
Input:  V={vi,v2)  vN} 

Output:  Dependency  graph/forest  G  and  a  set  of 
expressions  E={v,=f,(v  —  {vj})} 

1  while  time  budget  not  exceeded  do 

2  foreach  v,  do 

3  |  Evolve  v,;  =  fj(V-  {v;})  on  training  set 

4  end 

5  end 

6  E={},  D={  } 

7  foreach  v,  do 

8  g,=f;indBestOnParetoFront(f,J  on  validation  set 

9  E=E  U  {g,} 

to  d,  =ExtractPredictorSet(gi) 

it  D  =  D  U  {d,} 

12  end 

13  G=BuildGraph(V,D) 


B.  Hierarchical  SR  version  1  (hi) 

In  this  method,  we  enforce  hierarchical  extraction  of  the 
dependencies  in  an  iterative  manner.  At  each  iteration,  de¬ 
pendencies  for  one  non-stimulus  variable  are  discovered.  The 
algorithm  starts  with  only  the  stimuli  as  the  set  of  available 
independent  variables.  After  first  iteration,  the  variable  (v,) 
that  is  best  explained  by  a  subset  of  these  inputs  is  determined. 
Then,  all  predictors  for  variable  (v,J  are  removed  from  the 
set  of  available  independent  variables  in  accordance  with  our 
constraint  that  inputs  can  not  overlap.  Next,  v,t  is  added  to 
the  list  of  independent  variables.  The  algorithm  stops  after 
each  (non-stimulus)  variable  has  been  modeled  using  symbolic 
regression.  The  method  is  outlined  in  algorithm  2. 

As  opposed  to  the  naive  algorithm,  the  HSR  version  1 
works  in  epochs.  At  the  end  of  each  epoch,  one  variable  that 
is  modeled  the  best  is  selected  and  eliminated  from  the  set 
of  dependent  variables.  This  selection  is  done  by  identifying 
one  best  model  across  all  models  for  all  variables  (line  12 
of  algorithm  2).  For  N  dependent  variables,  there  are  N  non- 
dominated  sets  at  the  end  of  each  epoch.  A  combined  non- 
dominated  set  is  then  generated  in  terms  of  validation  set  error 
and  the  expression  complexity  (Fig.  3).  Throughout  this  paper, 
expression  complexity  is  computed  as  the  number  of  nodes  in 
the  expression  tree. 

Once  the  best  model  on  the  combined  non-dominated  set 
is  selected  (the  model  with  the  lowest  validation  data  error), 
the  corresponding  dependent  variable  and  its  predictors  are 
identified.  Since  these  predictors  can  only  be  used  to  model 
the  identified  dependent  variable  due  to  our  non-overlapping 
inputs  assumption,  they  are  removed  from  the  list  of  possible 
inputs  for  modeling  other  variables.  The  identified  dependent 
variable  is  added  to  the  list  of  possible  predictors  for  other 
variables  that  are  waiting  to  be  modeled.  This  process  actively 
enforces  the  extraction  of  hierarchical  relationships  and  gen¬ 
erates  a  set  of  easily  interpretable  expressions  instead  of  a  flat 
list  of  unstructured  expressions. 


Algorithm  2:  Hierarchical  SR  (HSR)  Algorithm  v.l 
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Input:  V={vi,v2,  ...,  Vn} 

Output:  Dependency  graph  G  and  a  set  of  expressions 
E={v,=fi  (v  -  {v;})} 

E={} 

I={si,S2,...Sjv} 

C=V,  D={  } 

N u  mEpochs=#  Dc  pc  rule  n  t  Variables 
timeBudget  =  totalTimeBudget/NumEpochs 
while  not  all  dependent  variables  are  modeled  do 
while  time  budget  not  exceeded  do 

foreach  v»  do 

|  Evolve  v,  =  fj(V-  {v,})  on  training  set 

end 

end 

g;=FindBestModel(fj)  on  validation  set 
E=E  U  {gj} 

d,  =ExtractPredictorSet(gi) 

D  =  D  U  {d,} 

C  =  C  -  {Vi} 

1=  I  -  {dj  } 

I  =1  U  {vj 

end 

G=BuildGraph(V,D) 


Reconsidering  the  motivating  example  in  Fig.  1,  it  is  easy 
to  see  that  the  top-level  component  v3  can  also  be  modeled  as 
V3  =  ti(si,s2,s3,s4).  However,  such  an  expression  will  be 
dominated  by  the  less  complex  but  equally  fit  expressions  for 
Vi  and  V2  upon  combining  all  non-dominated  sets  as  in  Fig.  3. 


error  on  validation  set 


Fig.  3:  Selection  of  the  best  model  on  the  combined  non- 
dominated  set  in  HSR  version  1 

C.  Hierarchical  SR  version  2  (h2) 

The  first  version  of  the  hierarchical  SR  algorithm  pools  all 
evolved  model  in  an  epoch  and  makes  the  selection  based  on 
the  best  model  on  the  validation  data  set.  In  the  second  version 
of  the  hierarchical  algorithm,  we  modify  this  selection  strategy. 
Instead  of  identifying  one  best  model  across  all  variables 
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and  then  extracting  the  predictors,  we  try  rather  the  opposite 
approach.  We  first  identify  the  best  non-dominated  set  and 
then  find  the  set  of  most  frequently  occurring  predictors  across 
all  non-dominated  models  for  the  corresponding  modeled 
variable.  An  additional  evolve  step  is  performed  in  order 
to  find  the  best  expression  based  on  the  identified  predictor 
variables  only. 

The  reasoning  behind  this  strategy  is  that  making  use  of 
the  statistics  about  how  the  current  set  of  dependent  variables 
are  used  across  the  Pareto  front,  rather  than  just  how  those 
variables  are  used  in  the  model  with  lowest  error,  might 
improve  its  performance  compared  to  hi.  The  method  is 
outlined  in  algorithm  3.  Similar  to  the  hi  algorithm,  the  h2 
algorithm  also  runs  in  epochs.  The  only  difference  is  in  the 
selection  of  the  best  model  and  the  predictor  variables  at  the 
end  of  each  epoch. 


Algorithm  3:  Hierarchical  SR  (HSR)  Algorithm  v.2 

Input:  V={vi,v2,  ...,  v^v} 

Output:  Dependency  graph  G  and  a  set  of  expressions 
E={vi=f2(v-  {v;})} 

1  E={} 

2  l={si,S2,...SAr} 

3  C=v,  D={  } 

4  NumEpochs=^Dependent  Variables 

5  timeBudget  =  totalTimeBudget/NumEpochs 

6  while  not  all  dependent  variables  are  modeled  do 

7  while  time  budget  not  exceeded  do 

8  foreach  v,:  do 

9  |  \i  =Evolve(  f,(V-  {v,})  )  on  training  set 

to  end 

n  end 

12  h,=FindBcstModclcdVariahle(v,)  on  validation  set 

13  d,  =ExtractPrcdictorSet(b,) 

14  b^=  Evolve(  ft({d,})  )  on  training  set 

is  g,=FindBcstModcl(f,)  on  validation  set 

16  E=E  U  {g,} 

17  D  =  D  U  {d,} 

18  C  =  C  -  {bj;) 

19  1=  1  -  {d,;  } 

20  I  =1  U  {b.j} 

21  end 

22  G=BuildGraph(V,D) 


The  selection  process  is  summarized  in  Fig.  4.  For  each 
modeled  variable,  all  non-dominated  models  are  evaluated  on 
the  validation  dataset  and  new  non-dominated  sets  based  on 
the  validation  dataset  error  versus  expression  complexity  are 
built  (line  12  of  algorithm  3).  For  each  non-dominated  set  (ns), 
the  fitness  is  computed  as  the  weighted  sum  of  the  error  on 
validation  dataset: 


The  non-dominated  set  with  lowest  weighted  error  is  se¬ 
lected  for  further  processing.  Ties  are  broken  in  favor  of  the 
smallest  total  expression  complexity.  Again,  reconsidering  the 
motivating  example  in  Fig.  1,  where  V3  can  also  be  modeled  as 
V3  =  h'(s  1,  s2,  S3,  S4),  the  weighted  error  fitness  will  penalize 
V3  because  of  the  higher  total  complexity  of  the  expressions. 

The  next  step  considers  the  models  that  are  in  the  non- 
dominated  set  for  the  selected  variable  only.  A  histogram 
of  unique  occurrences  of  each  predictor  is  generated  and 
the  set  of  most  frequently  occurring  predictors  are  selected 
by  identifying  the  cut-off  point  on  the  histogram.  The  final 
model  is  then  generated  via  symbolic  regression  using  only 
the  selected  predictors  as  the  terminal  symbols  (lines  14-15). 


error  on  validation  set 


Fig.  4:  Selection  of  the  best  variable  in  HSR  version  2  (line 
12  of  algorithm  3).  Selection  of  the  best  non-dominated  set 
(step  1),  building  the  predictor-frequency  histogram  (step  2), 
selecting  the  most  frequently  occurring  predictors  (step  3). 


V.  Experimental  Results 

In  this  section,  we  first  discuss  the  conceptual  differences 
between  the  naive  way  of  modeling  hierarchical  relationships 
in  multivariate  data  versus  actively  enforcing  hierarchical 
modeling  on  an  example  dataset.  Then,  we  outline  our  exper¬ 
imental  procedures  in  generating  a  large  benchmark  synthetic 
data  suite  with  varying  levels  of  difficulty  and  comparing  the 
three  algorithms.  As  it  is  stated  in  [16],  challenging  benchmark 
datasets  are  needed  for  genetic  programming  research.  Ideally, 
applying  algorithms  to  real-world  data  is  preferable.  However, 
especially  in  the  case  of  testing  new  algorithms,  we  believe 
that  it  is  very  important  to  have  control  over  the  data  genera¬ 
tion  process  so  that  analysis  of  the  strengths  and  weaknesses 
of  the  algorithms  can  be  easier.  Finally,  we  present  and  discuss 
our  findings  on  these  synthetic  benchmark  datasets. 

A.  An  example  16-Input  Binary  Tree  System 


M 

fitness(ns)  =  error  *  complexity 
2=1 


A  simple  16-input  synthetic  system  has  been  generated 
using  the  following  expressions  representing  the  relationships 
between  the  variables: 
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layer  1  : 

Vi  =  si  +  s  2 
V2  =  s3-  s4 

V3  =  S5  +  S6 
V4  =  S7  +  S8 
v5  =  s9  +  s10 
^6  =  Sn  +  S12 
V7  =  S13  +  S14 
V8  =  S15  +  Si6 
layer 2  : 

vg  =  vi/v2 
via  =  v3*v4 

Vv±\  ^5  ^6 

VV12  =V7  +  V8 
layer3  : 

V13  =  v9/vio 
Vvl4  =  till  ^  V12 
layer 4  : 

V15  =  Vi4  +  vi3 

The  resulting  dependency  graph  is  shown  in  Fig.  5.  The 
binary  tree  (arity=2)  consists  of  4  layers,  16  stimuli  and  15 
internal  nodes  which  are  the  variables  to  be  modeled.  The  total 
number  of  edges  in  the  tree  is  arity  *  internal  nodes  =  30. 
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Fig.  5:  True  dependency  graph  for  the  synthetic  16-input 
binary  tree  system 

Fig.  6  shows  the  best  dependency  graph  generated  by  the 
naive  SR  algorithm  in  terms  of  the  error  on  the  test  dataset. 
The  test  error  is  calculated  as  the  average  error  across  all 
modeled  variables.  Despite  the  low  error,  the  constructed 
dependency  graph  is  very  dissimilar  to  the  original  graph 
shown  in  Fig.  5.  A  closer  look  at  the  constructed  graph  and  the 
generated  models  for  the  variables  reveals  many  redundant  and 


cyclic  dependencies  as  well  as  a  number  of  ignored  stimuli. 
This  example  shows  that  even  though  the  given  dataset  is 
perfectly  hierarchical,  the  NSR  algorithm  might  fail  to  capture 
this  hierarchy.  Therefore,  solely  minimizing  the  prediction 
error  without  any  constraints  on  the  connectivity  might  be 
deceptive  for  the  purposes  of  modeling  hierarchical  systems. 
On  the  other  hand,  when  multiple  runs  are  performed,  it  was 
possible  for  the  algorithm  to  find  the  correct  dependency  graph 
structure  in  some  of  the  runs  along  with  the  lowest  possible 
test  dataset  error.  However,  for  this  algorithm,  a  low  test  error 
does  not  always  mean  that  the  hierarchy  in  the  underlying 
system  is  captured. 
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vl  —  v2  *  v9 
v2=  vl  /  v9 
v3=  s6  +  s5 
v4=  s8+  s7 
v5=  s9+  slO 
v6=  s12  +  s11 
v7=  si 4 
v8=  s16  +  s15 
v9=  vIO  *  v13 
v10=  v9  /  v13 
v11=  v14+  v12 
vl  2=  v7  +  v8 
v13=  v15-  v14 
v14=  vll -v12 
v15=  v14+v13 


Fig.  6:  Dependency  graph  with  lowest  test  set  error  (rmse: 
0.025)  generated  by  NSR 


Fig.  7  shows  the  dependency  graph  with  the  highest  test  set 
error  generated  using  the  HSR  version  1  (hi)  algorithm.  In 
terms  of  the  prediction  accuracy,  this  system  is  worse  than  the 
one  discovered  using  the  naive  approach  (Fig.  6).  However,  the 
constructed  graph  almost  perfectly  captures  the  true  system. 
The  high  prediction  error  was  caused  by  just  one  misplaced 
edge  (sis  is  erroneously  tied  to  instead  of  v8).  Therefore, 
for  hi  and  h2  algorithms,  a  high  test  set  error  does  not  always 
mean  failure  in  terms  of  how  close  the  hierarchy  is  captured. 


Fig.  7:  Dependency  graph  with  highest  test  set  error  (rmse: 
1.662)  generated  by  HSR  version  1 
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B.  Synthetic  Benchmark  Problems 

In  order  to  study  the  behavior  of  the  algorithms  across 
multiple  datasets  with  varying  difficulty,  we  generated  30 
synthetic  datasets  for  each  combination  of  arity=2,3,4,5  and 
layers=3,4,5,6.  A  total  of  480  datasets  were  generated  with 
random  stimuli  as  inputs  to  the  systems  and  randomly  gen¬ 
erated  expressions  to  represent  the  functions  of  the  state 
variables.  For  the  sake  of  simplicity,  the  expressions  included 
only  {+,-,*  and  protected  /  }  operators  without  any  constant 
values.  We  also  kept  the  degree  of  nonlinearity  constant  across 
all  datasets  by  enforcing  that  only  binary  nonlinear  interactions 
were  allowed  in  the  randomly  generated  expressions  for  the 
state  variables.  For  instance,  for  an  arity-5  system,  the  hidden 
expression  for  a  state  variable  can  be  v\  =  S1+S2  —  S4*Ss  —  S3, 
but  not  V\  =  Si  *  S2/S4  *  S5  —  S3.  By  design,  the  difficulty 
of  the  dataset  increases  as  the  data  dimensionality  increases 
(between  4  -  3125  inputs). 

TABLE  I:  Number  of  inputs,  variables  to  be  modeled  and 
number  of  edges  for  the  generated  synthetic  benchmark  trees 


Layers 

3 

4 

5 

6 

2 

4,  3,  6 

8,  7,  14 

16,  15,  30 

32,  31  ,62 

Arity 

3 

9,  4,  12 

27,  13,  39 

81,  40,  120 

243,  121,  363 

4 

16,  5,  20 

64,  21,  84 

256,  85,  340 

1024,  341,  1364 

5 

25,  6,  30 

125,  31,  155 

625,  156,789 

3125,  781,  3905 

Each  dataset  was  divided  into  training,  validation  and  test 
partitions  as  follows:  for  each  n-arity  tree  system,  all  ex¬ 
pressions  for  the  state  variables  (internal  nodes  of  the  tree) 
were  evaluated  for  5000  randomly  generated  stimuli  creating 
a  5000 x((aritylayers  —  1  )/ {arity  —  1))  dataset.  For  every  4 
rows,  the  first  two  rows  were  included  in  the  training  set,  while 
the  third  and  fourth  rows  are  included  in  the  validation  and  test 
sets  respectively.  The  training, validation  and  testing  partitions 
consisted  of  2500,  1250  and  1250  rows  each. 

C.  Results  on  Benchmark  Problems 

We  ran  each  algorithm  30  times  on  all  480  datasets  for  a 
run  time  budget  of  10  minutes  per  run  for  a  total  of  43200 
runs  on  a  cluster  computing  environment.  All  algorithms 
were  implemented  in  C++.  The  baseline  symbolic  regression 
implementation  that  is  used  by  all  three  algorithms  utilizes 
the  standard  tree  based  representation  with  sub-tree  crossover 
and  mutation  operators.  Fixed  values  for  population  size  (500), 
crossover  probability  (0.9)  and  mutation  probability  (0.1)  were 
used  across  all  experiments.  Root  mean  squared  error  (rmse) 
versus  expression  complexity  trade-off  along  with  the  age 
of  the  individuals  were  used  as  the  multi-objective  fitness 
function.  Similar  to  the  AFPO  algorithm  [17],  a  new  random 
individual  is  added  to  the  population  at  the  beginning  of  each 
generation.  For  n  state  variables  to  be  modeled,  the  population 
included  n  sub-populations,  each  of  which  were  set  up  to 
evolve  expressions  for  one  state  variable. 

We  report  the  results  for  each  tree  structure  separately  in 
Fig.  8.  For  each  arity-layer  pair,  the  results  are  pooled  over  30 
different  trees  and  30  runs  for  each  tree  resulting  in  a  total  of 
900  runs  per  algorithm.  In  each  case,  the  results  are  presented 


from  three  perspectives:  the  percentage  of  the  edges  that  were 
correctly  discovered,  the  prediction  error  of  the  generated  tree 
on  the  test  set,  the  distribution  of  test  set  error  versus  the 
percentage  of  the  edges  correctly  discovered. 

The  statistical  significance  of  the  results  are  reported  as 
follows:  for  the  percentage  of  correct  edges,  we  compare  each 
pair  of  algorithms  using  the  left-tailed  Wilcoxon  rank  sum  test 
with  Bonferroni  correction  and  unequal  variances  assumption. 
In  those  cases  where  the  hi  and  h2  algorithms  are  significantly 
better  than  the  naive  algorithm,  and  when  the  h2  algorithm 
is  significantly  better  than  the  hi  algorithm,  the  significance 
is  presented  using  *  sign  (*  *  *:a  =  0.001,  **\a  =  0.01, 
*:a  =  0.05).  A  similar  comparison  is  performed  on  the  test 
error  results,  the  only  difference  being  the  Wilcoxon  rank  sum 
test  to  be  a  right-tailed  test  since  lower  test  error  indicates 
better  performance  in  this  case. 

The  top  row  shows  the  results  for  2-arity  tree  systems  with 
varying  numbers  of  layers.  In  terms  of  capturing  the  hierarchy, 
hi  and  h2  almost  always  discover  the  correct  connectivity 
matrix  and  consistently  outperform  the  naive  algorithm.  As  far 
as  the  test  error  is  concerned,  hi  and  h2  clearly  outperform 
the  naive  algorithm  only  when  the  ratio  of  the  total  nodes  to 
the  stimuli  gets  too  large  for  the  naive  algorithm  to  deal  with. 
This  happens  when  the  height  of  the  tree  increases  to  5.  The 
heatmaps  show  that  the  naive  algorithm  mostly  finds  low-error 
models  at  the  expense  of  missing  many  edges. 

As  the  arity  of  the  trees  increase,  the  trees  get  broader,  as 
a  result,  the  number  of  leaves  (stimuli)  increases.  Since  this 
will  increase  the  number  of  possible  predictors,  the  search 
becomes  more  and  more  difficult  for  all  three  algorithms. 
Accordingly,  the  hi  and  h2  algorithms  start  to  lose  their 
advantage  in  faithfully  modeling  the  hierarchy  as  the  arity 
increases.  Another  source  of  difficulty  for  all  algorithms  is 
the  increase  in  tree  height.  As  the  trees  get  taller,  the  number 
of  leaves  (stimuli)  also  increases.  Except  for  2-arity  systems 
(top  row  in  Fig.  8),  hi  and  h2  failed  to  complete  within  the 
assigned  run  time  budget  in  the  case  of  the  tallest  trees. 

D.  Discussion 

Our  results  on  the  synthetic  benchmark  datasets  clearly 
show  that  for  binary  input  problems  (arity  2),  the  hierarchical 
SR  algorithms  consistently  outperform  the  naive  SR  as  the 
problem  difficulty  increases  (more  tree  layers).  However,  the 
lack  of  scaling  to  much  higher  arity  problems  is  due  to  a 
number  of  reasons.  High  data  dimensionality  is  a  big  challenge 
for  symbolic  regression.  In  additional  experiments  (not  shown 
here),  increasing  the  time  budget  from  10  minutes  to  1  hour 
did  not  significantly  increase  the  performance  of  hi  and 
h2.  Therefore,  efficient  feature  selection  schemes  for  high 
dimensional  data  within  symbolic  regression  is  definitely  an 
area  for  improvement. 

Specifically,  the  results  indicate  that  as  the  trees  get  broader 
and  taller,  the  initial  hi  and  h2  algorithms  do  not,  in  their 
current  form,  continue  to  outperform  the  naive  approach. 
This  is  due  to  the  fact  that  in  these  cases,  the  number  of 
epochs  significantly  increases,  which  will  in  turn  decrease  the 
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Fig.  8:  Comparison  of  the  algorithms  across  varying  data  dimensionality  and  hierarchical  organization  given  10  minutes  run¬ 
time  budget.  Problem  difficulty  increases  from  left-right  and  top-bottom.  For  each  problem  type,  the  plots  show  the  percentage 
of  the  edges  that  were  correctly  discovered  (top),  the  prediction  error  of  the  generated  tree  on  the  test  set  (middle),  the 
distribution  of  test  set  error  versus  the  percentage  of  the  edges  correctly  discovered  (bottom).  The  naive  approach  mostly  fails 
to  recover  the  hierarchical  network  topology.  The  hierarchical  approach  outperforms  the  naive  approach  for  easier  problems 
(small  arity  and/or  short  trees).  All  three  algorithms  perform  poorly  for  more  difficult  problems  (larger  arity  and  taller  trees). 
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amount  of  time  for  each  individual  epoch.  This  increases  the 
risk  that  hi  or  h2  will  select  the  wrong  predictors.  Indeed 
h2  was  initially  devised  to  reduce  the  risk  of  selecting  the 
wrong  dependent  variables  during  an  epoch.  It  was  thought 
that  providing  h2  with  statistics  about  how  the  current  set  of 
dependent  variables  are  used  across  the  Pareto  front,  rather 
than  just  how  those  variables  are  used  in  the  model  with 
lowest  error,  would  improve  its  performance  compared  to 
hi.  However  this  was  not  found  to  be  the  case:  instead, 
hi  significantly  outperformed  the  naive  algorithm  on  seven 
of  the  tree  structures  (Fig.  8)  while  h2  only  significantly 
outperformed  the  naive  algorithm  on  five  of  them. 

We  hypothesize  that  h2  may  be  further  improved  by  incor¬ 
porating  diversity  maintaining  measures  that  increase  the  size 
of  the  membership  on  the  Pareto  front.  This  should  improve 
the  reliability  of  the  statistics  computed  across  the  front  and 
thus  improve  the  probability  of  selecting  the  correct  set  of 
dependent  variables.  We  also  plan  to  explore  alternative  meth¬ 
ods  for  selecting  the  cutoff  point  in  the  predictor-frequency 
histogram.  Additionally,  in  h2,  predictor-frequency  histograms 
constructed  from  previous  epochs  are  discarded;  in  future  work 
we  will  investigate  ways  to  re-use  information  from  previous 
histograms  to  produce  more  accurate  histograms  in  the  current 
epoch.  Also,  for  both  hi  and  h2,  selecting  more  than  one 
accurately  modeled  variable  in  each  epoch  will  speed  up  the 
algorithms  by  reducing  the  number  of  epochs. 

Finally,  we  note  that  our  initial  implementations  did  not 
utilize  any  parallel  and  distributed  techniques.  Our  algorithms 
have  the  potential  to  scale  to  real-world  problems  upon  utiliz¬ 
ing  GPUs  [18]  and/or  cloud  computing  [19]. 

VI.  Conclusion 

Extracting  and  visualizing  the  relationships  in  a  hierarchical 
system  as  a  dependency  graph  improves  the  intelligibility  of 
the  overall  model,  compared  to  the  flat  list  of  equations  pro¬ 
duced  by  traditional  symbolic  regression.  Our  results  clearly 
show  that  in  order  to  find  hierarchy,  one  needs  to  explicitly 
search  for  it  rather  than  waiting  for  the  hierarchical  models  to 
emerge  in  an  unconstrained  search  such  as  in  the  naive  SR. 
Moreover,  it  was  found  that  explicitly  seeking  hierarchy  in  a 
data  set  leads  to  more  accurate  models  compared  to  traditional 
symbolic  regression.  These  algorithms  were  tested  against  a 
large  number  of  synthetic  datasets  with  increasing  difficulty 
in  terms  of  data  dimensionality  and  hierarchical  organization. 
Even  though  the  intuition  suggests  that  the  HSR  version  2  (h2) 
algorithm  would  be  more  robust  since  it  considers  multiple 
models  in  making  the  selection  for  the  best  modeled  variable, 
our  experimental  results  on  480  synthetic  datasets  showed  no 
clear  advantage  over  the  HSR  version  1  (hi)  which  makes  the 
selections  based  on  one  best  model  at  each  stage. 

The  focus  of  our  current  work  is  to  further  explore  more 
efficient  ways  for  the  selection  process  at  each  stage  and  to  ex¬ 
tend  the  algorithm  to  model  more  general  systems  that  exhibit 
mixtures  of  hierarchy  and  network  connectivity.  Ultimately, 
our  goal  is  to  apply  our  algorithms  to  real-world  problems  such 
as  functional  brain  connectivity  and  gene  expression  networks. 
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5.4  Szubert  et  al.  “Reducing  antagonism  between...”  (2016). 

A  technical  manuscript  describing  how  to  reduce  antagonism  between  model  diversity  and  accu¬ 
racy  follows. 
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ABSTRACT 

Maintaining  population  diversity  has  long  been  considered 
fundamental  to  the  effectiveness  of  evolutionary  algorithms. 
Recently,  with  the  advent  of  novelty  search,  there  has  been 
an  increasing  interest  in  sustaining  behavioral  diversity  by 
using  both  fitness  and  behavioral  novelty  as  separate  search 
objectives.  However,  since  the  novelty  objective  explicitly 
rewards  diverging  from  other  individuals,  it  can  antagonize 
the  original  fitness  objective  that  rewards  convergence  to¬ 
ward  the  solution(s).  As  a  result,  fostering  behavioral  diver¬ 
sity  may  prevent  proper  exploitation  of  the  most  interest¬ 
ing  regions  of  the  behavioral  space,  and  thus  adversely  af¬ 
fect  the  overall  search  performance.  In  this  paper,  we  argue 
that  an  antagonism  between  behavioral  diversity  and  fitness 
can  indeed  exist  in  semantic  genetic  programming  applied 
to  symbolic  regression.  Minimizing  error  draws  individuals 
toward  the  target  semantics  but  promoting  novelty,  defined 
as  a  distance  in  the  semantic  space,  scatters  them  away  from 
it.  We  introduce  a  less  conflicting  novelty  metric,  defined  as 
an  angular  distance  between  two  program  semantics  with  re¬ 
spect  to  the  target  semantics.  The  experimental  results  show 
that  this  metric,  in  contrast  to  the  other  considered  diver¬ 
sity  promoting  objectives,  allows  to  consistently  improve  the 
performance  of  genetic  programming  regardless  of  whether 
it  employs  a  syntactic  or  a  semantic  search  operator. 

Keywords 

genetic  programming;  program  semantics;  novelty  search; 
diversity;  geometric  crossover;  symbolic  regression 

1.  INTRODUCTION 

In  analogy  to  the  importance  of  genetic  diversity  in  natu¬ 
ral  evolution,  preserving  population  diversity  has  long  been 
perceived  as  being  crucial  to  the  performance  of  evolutionary 
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algorithms.  Intuitively,  maintaining  a  diverse  pool  of  candi¬ 
date  solutions  provides  better  exploration  of  the  search  space 
and  thus  gives  more  opportunities  to  discover  novel,  poten¬ 
tially  fitter  individuals.  On  the  other  hand,  losing  diversity 
can  lead  to  the  well-known  problem  of  premature  conver¬ 
gence,  where  a  population  stagnates  at  local  optima  and  is 
unlikely  to  make  any  further  progress. 

A  number  of  diversity  maintenance  techniques  have  been 
proposed  to  mitigate  the  problem  of  premature  convergence 
[10,  26].  Most  of  these  methods  modify  the  selection  process 
by  promoting  the  individuals  that  are  most  different  from 
the  rest  of  the  population.  One  particular  approach  relies 
on  multiobjective  evaluation  of  individuals  with  two  objec¬ 
tives:  the  original  fitness  of  the  solution  and  some  measure  of 
its  novelty  designed  to  promote  diversity.  Although  earlier 
studies  measured  novelty  by  comparing  genotypes  [6] ,  recent 
work  has  successfully  employed  novelty  metrics  based  on  the 
distance  between  behaviors  [16,  17,  23]. 

However,  since  behavioral  novelty  promotes  increasing  dis¬ 
tance  between  behaviors  while  the  fitness  function  typically 
rewards  minimizing  distance  to  the  target  behavior,  we  hy¬ 
pothesize  that  in  some  cases  these  two  objectives  can  be 
overly  antagonistic  with  each  other.  Consequently,  promot¬ 
ing  diversity  can  result  in  spreading  individuals  over  the 
behavioral  space  and  slowing  down  the  convergence  of  the 
search  process.  In  other  words,  under  certain  conditions, 
employing  such  conflicting  objectives  may  result  in  exces¬ 
sive  exploration  of  the  entire  behavioral  space  and  insuffi¬ 
cient  exploitation  of  its  most  promising  regions. 

In  this  paper,  we  investigate  the  relationship  between  be¬ 
havioral  diversity  and  fitness  of  evolved  individuals  in  the 
context  of  genetic  programming  (GP),  where  behavior  of  an 
individual  can  be  identified  with  program  semantics.  In  par¬ 
ticular,  we  attempt  to  determine  whether  and  under  what 
conditions  promoting  behavioral  diversity  can  adversely  af¬ 
fect  the  search  effectiveness.  To  this  end,  we  consider  four  di¬ 
versity  promoting  objectives  and  examine  how  each  of  them, 
used  along  with  the  fitness  objective,  affects  the  performance 
of  tree-based  GP.  Moreover,  we  compare  the  fitness  of  pro¬ 
grams  evolved  with  two  types  of  search  operators:  tradi¬ 
tional  subtree-swapping  crossover  and  locally  geometric  se¬ 
mantic  crossover.  Since  fitness  landscapes  induced  by  the 
latter  are  supposedly  smoother  and  easier  to  search  with 
the  fitness  objective  alone,  we  expect  to  observe  different 
effects  of  promoting  diversity. 
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The  results  obtained  on  a  set  of  symbolic  regression  prob¬ 
lems  demonstrate  that  some  diversity  objectives  can  be  in¬ 
deed  detrimental  to  the  search  performance,  supposedly  be¬ 
cause  of  being  overly  antagonistic  with  the  fitness  objective. 
In  particular,  we  show  that  using  straightforward  Euclidean 
semantic  novelty  metric  can  lead  to  reduced  performance 
with  respect  to  the  conventional  genetic  programming.  By 
contrast,  the  introduced  angular  semantic  novelty  metric, 
designed  to  be  less  antagonistic  with  the  fitness  objective,  al¬ 
lows  to  consistently  improve  both  fitness  and  generalization 
performance,  regardless  of  the  employed  search  operator. 

The  remainder  of  this  paper  is  structured  as  follows.  The 
next  section  describes  the  paradigm  of  semantic  genetic  pro¬ 
gramming  and  presents  geometric  semantic  operators.  Sec¬ 
tion  3  gives  a  brief  overview  of  diversity  maintenance  meth¬ 
ods  applied  in  GP  and  introduces  the  two  aforementioned 
semantic  novelty  metrics.  Sections  4  and  5  describe  experi¬ 
mental  setup  and  present  the  results.  Finally,  sections  6  and 
7  provide  discussion  and  concluding  remarks. 

2.  SEMANTIC  GENETIC  PROGRAMMING 

Standard  tree-based  GP  searches  the  space  of  programs 
using  traditional  operators  of  subtree-swapping  crossover 
and  subtree-replacing  mutation.  These  operators  are  de¬ 
signed  to  be  generic  and  produce  syntactically  correct  off¬ 
spring  regardless  of  the  problem  domain.  However,  their 
actual  effects  on  the  behavior  of  the  program,  and  thus  its 
fitness,  are  generally  hard  to  predict.  Because  of  the  complex 
genotype-phenotype  mapping  characterized  by  low  locality, 
even  a  minimal  change  at  the  syntax  level  may  diametrically 
alter  program  semantics.  Such  large  phenotypic  changes  are 
often  considered  problematic  because,  according  to  Fisher’s 
geometric  model  [7],  the  probability  of  the  mutation  being 
beneficial  is  inversely  proportional  to  its  magnitude. 

Recently,  many  alternative  search  operators  have  been 
proposed  that  take  into  account  the  effect  of  syntactic  mod¬ 
ifications  on  program  semantics  [1,  3,  15,  22,  30].  In  order 
to  control  the  scope  of  behavioral  change,  most  of  these 
methods  adopt  common  definition  of  program  semantics, 
known  as  sampling  semantics  [30],  which  is  identified  with 
the  vector  of  outputs  produced  by  a  program  for  a  sam¬ 
ple  of  possible  inputs.  For  instance,  in  supervised  learning, 
where  n  input-output  pairs  are  given  as  a  training  set  T  = 
{(xi,  j/i ), . . . ,  (xn,  yn)},  semantics  of  a  program  p  is  equal 
to  vector  s(p)  =  [p(xi), . . .  ,p(xn)],  where  p(x)  is  a  result 
obtained  by  running  program  p  on  input  x.  Consequently, 
each  program  p  corresponds  to  a  point  in  n-dimensional  se¬ 
mantic  space  and  a  metric  d  can  be  adopted  to  measure 
semantic  distance  between  two  programs.  Furthermore,  fit¬ 
ness  of  a  program  p  can  be  calculated  as  a  distance  between 
its  semantics  s  (p)  and  the  target  semantics  t  =  [3/1, ... ,  yn\ 
defined  by  the  training  set,  i.e.,  f(p)  =  d(s(p),t). 

Importantly,  the  information  about  program  semantics 
can  be  exploited  not  only  at  the  level  of  search  operators 
but  also  for  other  purposes,  e.g,  to  maintain  semantic  di¬ 
versity  [11],  to  initialize  the  population  [2]  or  to  drive  the 
selection  process  [18].  All  such  semantic-aware  methods  are 
collectively  captured  by  the  umbrella  term  of  semantic  ge¬ 
netic  programming  [31].  Recently,  a  paradigm  of  behavioral 
program  synthesis  [13]  has  been  proposed,  which  extends 
semantic  GP  by  using  information  not  only  about  final  pro¬ 
gram  results  but  also  about  behavioral  characteristics  of  pro¬ 
gram  execution. 


2.1  Geometric  Semantic  Operators 

One  particularly  interesting  class  of  semantic- aware  search 
operators  are  geometric  semantic  operators  introduced  by 
Moraglio  et  al.  [22].  These  operators  not  only  incorpo¬ 
rate  knowledge  about  program  semantics  but  also  exploit 
geometric  structure  of  the  semantic  space  endowed  by  a 
metric-based  fitness  function.  As  a  result,  fitness  landscapes 
seen  by  these  operators  are  smooth  conic  landscapes,  which 
are  in  principle  easy  to  search.  In  particular,  a  geomet¬ 
ric  semantic  crossover  under  the  metric  d  guarantees  that 
semantics  of  each  offspring  p'  is  located  in  the  d-metric  seg¬ 
ment  connecting  semantics  of  its  parents  pi  and  P2,  i.e., 
d(s(pi),s(p2))  =  d(s(pi),s(p'))  +  d(s{p'),s(p2)) 

Although  exact  geometric  crossover  has  been  proposed 
[22],  its  practical  applicability  is  limited  because  it  leads 
to  exponential  growth  of  the  program  size.  For  this  reason, 
alternative  operators  exist  that  employ  heuristic  methods 
to  produce  an  approximately  geometric  offspring  [14,  15]. 
Previous  studies  demonstrate  that  such  approximately  geo¬ 
metric  operators  can  be  still  effective  while  producing  much 
shorter  offspring  programs  than  exactly  geometric  ones. 

2.2  Locally  Geometric  Crossover 

In  this  paper  we  use  Locally  Geometric  Crossover  (LGX) 
proposed  by  Krawiec  and  Pawlak  [15].  This  operator  is  ar¬ 
guably  the  easiest  to  implement  among  existing  approxi¬ 
mately  geometric  crossover  operators.  Before  applying  a 
crossover,  a  library  of  short  programs  (procedures)  must 
be  created.  Typically,  a  static  library  is  generated  by  enu¬ 
merating  all  possible  trees  lower  than  a  predefined  height. 
Alternatively,  a  dynamic  library  could  be  created  at  each 
generation  from  all  subtrees  existing  in  the  population. 

Given  two  parents  pi  and  p2,  the  operator  starts  by  iden¬ 
tifying  their  structurally  common  region,  i.e.,  the  largest 
region  where  the  parent  trees  have  the  same  topology.  Two 
crossover  points  are  selected  by  drawing  a  pair  of  corre¬ 
sponding  nodes  from  the  common  region.  Then,  for  the 
subtrees  p\  and  p'2  rooted  at  the  crossover  points,  semantics 
of  the  midpoint  between  them  (i.e.,  semantically  interme¬ 
diate  subprogram)  is  calculated  as  sm  =  (s(pi)  +  s (p2))  /2. 
The  library  is  searched  for  programs  that  are  semantically 
closest  to  sm  according  to  adopted  metric  d.  From  a  set  of  k 
closest  programs  found  in  a  library,  a  random  one  is  selected 
and  used  to  replace  subtrees  p'\  and  p2  in  both  parents,  pro¬ 
ducing  two  offspring.  In  a  rare  situation  when  both  subtrees 
Pi  and  p'2  are  semantically  equivalent,  a  random  procedure 
is  drawn  from  a  library. 

3.  PROMOTING  DIVERSITY  IN  GP 

Diversity  maintenance  has  been  a  long-standing  issue  in 
GP  and  a  number  of  methods  have  been  proposed  to  pre¬ 
serve  diversity  in  a  population  [4],  Most  of  the  early  stud¬ 
ies  in  this  area  focus  on  genotypic  diversity,  which  refers 
to  structural  differences  between  programs  in  a  population 
[6,  21],  In  recent  years,  with  the  advent  of  semantic  GP, 
more  attention  has  been  paid  to  semantic  or  behavioral  di¬ 
versity  [2,  9,  11,  19].  The  notion  of  semantic  diversity  is 
particularly  important  in  GP,  because  the  mapping  between 
programs  and  their  semantics  is  usually  a  complex,  non- 
injective  function.  In  particular,  since  many  syntactically 
different  programs  may  exhibit  the  same  behavior,  geno¬ 
typic  diversity  does  not  necessarily  imply  semantic  diversity 
while  the  converse  is  often  true. 
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Despite  the  assumed  importance  of  semantic  diversity  [31] , 
there  have  been  few  empirical  investigations  into  effects  of 
promoting  behavioral  diversity  on  the  effectiveness  of  genetic 
programming.  Moreover,  almost  all  of  the  studied  methods 
are  limited  to  ensuring  that  the  genetic  operators  do  not 
produce  offspring  that  is  semantically  equivalent  to  their 
parents  [1,  30,  11,  9].  To  the  best  of  our  knowledge,  the 
only  exception  is  the  work  of  Nguyen  et  al.  [24] .  The  authors 
apply  both  syntactic  and  semantic  distance  metrics  in  the 
fitness  sharing  mechanism  and  demonstrate  that  only  using 
the  latter  improves  GP  performance  on  selected  symbolic 
regression  problems. 

Here,  rather  than  fitness  sharing  we  adopt  multiobjective 
approach  treating  diversity  as  a  separate  objective.  In  the 
following  we  describe  four  considered  variants  of  multiobjec¬ 
tive  GP,  which  differ  only  with  respect  to  the  objective  used 
to  encourage  diversity.  In  particular,  two  of  the  objectives 
(age  and  structural  density)  have  already  proved  success¬ 
ful  in  improving  GP  performance.  Additionally,  we  propose 
two  other  objectives  which  are  essentially  behavioral  novelty 
metrics  designed  to  promote  semantic  diversity. 

3.1  Age-Fitness  Pareto  Optimization 

Age-Fitness  Pareto  Optimization  (AFPO,  [27])  is  a  mul¬ 
tiobjective  method  that  relies  on  the  concept  of  genotypic 
age  of  an  individual,  defined  as  the  number  of  generations 
its  genetic  material  has  been  in  the  population  [10] .  The  age 
attribute  is  intended  to  protect  young  individuals  before  be¬ 
ing  dominated  by  older  already  optimized  solutions.  Each 
randomly  initialized  individual  starts  with  age  of  one  which 
is  then  incremented  by  one  every  generation.  An  offspring 
inherits  age  of  the  older  parent. 

The  AFPO  algorithm  is  based  on  the  ParetoGP  method 
which  was  originally  proposed  to  address  the  issue  of  bloat 
in  GP  [28].  The  algorithm  starts  with  a  population  of  n  ran¬ 
domly  initialized  individuals.  In  each  generation,  it  proceeds 
by  selecting  random  parents  from  the  population  and  apply¬ 
ing  crossover  and  mutation  operators  (with  certain  probabil¬ 
ity)  to  produce  n—  1  offspring.  The  offspring,  together  with 
a  single  randomly  initialized  individual,  are  added  to  the 
population  extending  its  size  to  2 n.  Then,  Pareto  tourna¬ 
ment  selection  is  iteratively  applied  by  randomly  selecting 
a  subset  of  individuals  and  removing  the  dominated  ones 
until  the  size  of  the  population  is  reduced  back  to  n.  To 
determine  which  individuals  are  dominated,  the  algorithm 
identifies  the  Pareto  front  using  two  objectives  (both  mini¬ 
mized):  age  and  fitness  (distance  to  the  target  semantics). 

3.2  Density-Fitness  Pareto  Optimization 

Recently,  Burks  and  Punch  [5]  proposed  an  alternative 
variant  of  the  AFPO  algorithm  called  Density-Fitness  Pareto 
Optimization  (DFPO).  This  method  relies  on  the  idea  of  a 
genetic  marker ,  which  refers  to  concatenated  fragments  of  a 
program  tree.  The  authors  used  markers  based  on  the  top¬ 
most  part  of  a  tree  and  calculated  structural  density  of  each 
individual  as  a  fraction  of  individuals  in  the  population  that 
share  the  same  marker.  Employing  such  a  density  measure 
as  a  minimized  objective  is  intended  to  maintain  a  specific 
form  of  structural  diversity  focused  on  the  rooted  portions 
of  the  trees.  According  to  the  reported  results  obtained  on 
three  different  problems  (including  symbolic  regression),  us¬ 
ing  density  instead  of  age  allows  DFPO  to  further  improve 
the  performance  achieved  by  AFPO. 


Figure  1:  Residual  vectors  in  two-dimensional  se¬ 
mantic  space  where  fitness  is  expressed  using  Eu¬ 
clidean  distance. 


3.3  Novelty-Fitness  Pareto  Optimization 

Inspired  by  novelty  search  [16],  we  propose  two  behavioral 
novelty  metrics  that  can  be  used  as  search  objectives.  Since 
both  objectives  refer  to  the  distribution  of  programs  in  the 
semantic  space,  maximizing  them  is  intended  to  promote 
some  form  of  behavioral  diversity.  The  bi-objective  algo¬ 
rithm  employing  fitness  and  a  behavioral  novelty  objective 
is  termed  Novelty-Fitness  Pareto  Optimization. 

Euclidean  Semantic  Novelty.  Since  in  this  work  we 
focus  on  real-valued  symbolic  regression  problems,  the  se¬ 
mantic  space  is  n-dimensional  real  space.  Consequently, 
we  can  calculate  behavioral  distance  between  programs  as 
a  Euclidean  distance  between  their  semantics.  We  define 
Euclidean  semantic  novelty  of  a  program  p  as  a  mean  Eu¬ 
clidean  distance  between  its  semantics  s  (p)  and  semantics  of 
its  k  nearest  neighbors  in  the  semantic  space: 

1  k 

P (P)  = 

^  i—1 

where  k  is  user-defined  parameter  and  pi  is  i-th  nearest  pro¬ 
gram  with  respect  to  the  semantic  distance. 

Angular  Semantic  Novelty.  The  second  proposed  nov¬ 
elty  metric  focuses  on  angles  in  the  semantic  space  (see  Fig. 
1).  Measuring  angular  distance  between  program  seman¬ 
tics  has  been  recently  applied  in  GP  [25]  but  not  for  the 
purpose  of  maintaining  diversity.  For  each  program  p,  we 
define  residual  vector  r(p)  as  a  difference  between  target  se¬ 
mantics  and  the  program  semantics,  i.e.,  r (p)  =  t  — s(p).  We 
define  angular  semantic  novelty  of  a  program  p  as  a  mean 
angle  between  its  residual  vector  r(p)  and  residual  vectors  of 
its  k  nearest  neighbors  with  respect  to  the  angular  distance: 


p(p) 


1  r (p)  •  r(m) 

r  >  arccos  ^ . 

kt~i  r(p)  r(Ai) 
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Table  1:  Symbolic  regression  benchmarks. 


Problem 

Objective  function 

Quartic 

4a:4  +  3a;3  +  2x2  +  x 

Nonic 

R1 

(x  +  l)3/(a;2  —  x  +  1) 

R2 

(x5  —  3a;3  +  l)/(a;2  +  1) 

Keijzer-4 

x3e~x  cos(x)  sin(*)(sin2(a;)  cos(a;)  —  1) 

We  expect  that  using  this  novelty  metric  as  an  additional 
search  objective  can  be  beneficial  for  two  reasons.  First,  this 
objective  is  less  conflicting  with  fitness  than  Euclidean  se¬ 
mantic  novelty  —  a  population  of  very  fit  individuals  can  at 
the  same  time  exhibit  high  angular  semantic  diversity.  Sec¬ 
ond,  promoting  large  angles  between  residual  vectors  makes 
it  more  likely  that  the  parents  occupy  the  opposite  slopes  of 
the  fitness  landscape,  which  is  advantageous  for  geometric 
semantic  crossover.  For  instance,  consider  three  programs 
illustrated  in  Fig.  1.  Let  us  assume  that  pi  is  the  first  parent 
and  we  need  to  pick  the  second  parent  among  programs  P2 
and  P3,  which  are  equally  fit  (have  the  same  distance  to  the 
target  semantics).  By  considering  possible  offspring  pi  x  P3 
and  pi  x  p 2,  it  can  be  observed  that  fitness  of  the  geomet¬ 
ric  offspring  is  inversely  proportional  to  the  angle  between 
residual  vectors  of  its  parents. 

4.  EXPERIMENTAL  SETUP 

The  main  goal  of  the  experiments  is  to  investigate  whether 
and  how  promoting  particular  forms  of  diversity  affect  the 
fitness  of  programs  evolved  with  tree-based  GP.  For  this  pur¬ 
pose,  we  analyze  the  performance  of  multiobjective  diver¬ 
sity  promoting  methods  described  in  Section  3  and  compare 
them  to  the  standard  GP  driven  by  the  fitness  objective 
alone.  All  the  considered  algorithms  were  implemented  as 
an  extension1  of  the  Distributed  Evolutionary  Algorithms  in 
Python  (DEAP)  framework  [8]. 

4.1  Symbolic  Regression  Problems 

We  consider  five  univariate  symbolic  regression  problems 
that  are  adopted  from  previous  studies  [5,  20].  Selected 
benchmarks  (see  Table  1)  include  polynomial,  rational  and 
trigonometric  functions.  For  each  problem,  fitness  was  cal¬ 
culated  as  Euclidean  distance  to  the  target  semantics  on  20 
training  cases  distributed  equidistantly  in  the  [—1,  1]  inter¬ 
val.  The  only  exception  is  Keijzer-4,  for  which  the  training 
cases  were  sampled  from  the  range  [0, 10] . 

4.2  Genetic  Programming  Variants 

We  compare  the  performance  of  the  following  five  vari¬ 
ants  of  tree-based  GP.  Four  of  them  rely  on  multiobjective 
fitness  evaluation  where  one  of  the  objectives  actively  pro¬ 
motes  some  form  of  diversity.  These  setups  differ  only  with 
respect  this  objective.  All  the  other  settings  remain  un¬ 
changed  and  they  are  summarized  in  Table  2. 

GP.  To  observe  the  relative  impact  of  promoting  diver¬ 
sity,  as  a  baseline  method  we  use  standard  generational  tree- 
based  GP  with  single-objective  tournament  selection. 

xThe  source  code  necessary  for  reproducing  our  results  is 
available  at  https://github.com/mszubert/gecco_2016. 


Table  2:  Genetic  programming  settings 


Parameter 

Value 

population  size 

256 

generations 

1000 

initialization 

ramped  half-and-half 
height  range  2  —  6 

instruction  set 

{+,  — ,  x ,  /,  exp,  log,  sin,  cos} 

tournament  size 

7 

crossover  probability 

0.9 

reproduction  probability 

0.1 

mutation  probability 

0.0 

node  selection 

90%  internal  nodes 

10%  leaves 

maximum  tree  height 

17 

maximum  tree  size 

300 

number  of  runs 

100 

AFPO.  Age-Fitness  Pareto  Optimization  algorithm  de¬ 
scribed  in  Section  3.1. 

DFPO.  Density-Fitness  Pareto  Optimization  algorithm 
(see  Section  3.2).  To  calculate  the  density  objective,  genetic 
markers  were  constructed  using  first  three  levels  of  each  tree. 

ESNFPO.  Novelty-Fitness  Pareto  Optimization  (see  Sec¬ 
tion  3.3)  with  Euclidean  semantic  novelty  objective  using 
k  —  15  nearest  neighbors  to  calculate  novelty  score. 

ASNFPO.  Novelty-Fitness  Pareto  Optimization  (see  Sec¬ 
tion  3.3)  with  angular  semantic  novelty  objective  using  k  = 
15  nearest  neighbors  to  calculate  novelty  score. 

4.3  Search  Operators 

To  gain  deeper  understanding  about  usefulness  of  diver¬ 
sity  under  different  conditions,  we  combine  each  of  the  con¬ 
sidered  GP  variants  with  the  following  search  operators. 

Standard  syntactic  crossover.  Traditional  subtree¬ 
swapping  crossover  operator  with  Koza-style  node  selection: 
0.9  probability  of  choosing  an  internal  node  [12]. 

Geometric  semantic  crossover.  Locally  geometric  se¬ 
mantic  crossover  (LGX,  see  Section  2.2)  based  on  a  static 
precomputed  library  of  procedures.  The  library  is  generated 
by  enumerating  all  possible  trees  of  height  at  most  3,  built 
from  the  given  instruction  set.  When  queried  with  a  desired 
semantics,  library  returns  a  random  program  among  k  =  8 
with  closest  semantics. 

4.4  Diversity  Measures 

To  analyze  the  relationship  between  behavioral  diversity 
of  a  population  and  fitness  of  evolved  programs,  the  follow¬ 
ing  diversity  measures  were  calculated  for  each  generation. 

Median  Euclidean  Semantic  Distance.  To  assess  Eu¬ 
clidean  semantic  diversity  we  calculate  median  of  semantic 
distances  between  each  pair  of  programs  in  the  population. 

Mean  Angular  Semantic  Distance.  Angular  seman¬ 
tic  diversity  is  measured  as  a  mean  angle  between  residual 
vectors  of  each  pair  of  programs  in  the  population. 

5.  RESULTS 

In  order  to  conduct  an  accurate  analysis  of  the  relation¬ 
ship  between  diversity  and  performance,  we  conducted  100 
independent  runs  (with  different  random  seeds)  of  each  of 
10  considered  configurations  (5  GP  variants  x  2  crossover 
operators)  on  each  of  5  symbolic  regression  problems. 


99 

Approved  for  public  release;  distribution  is  unlimited. 


5.1  Search  Performance 

Figure  2  shows  the  average  best-of-generation  fitness  (cal¬ 
culated  as  a  Euclidean  distance  to  the  target)  achieved  by 
particular  methods  on  different  benchmark  problems,  with 
95%  confidence  intervals  marked  as  semi-transparent  bands. 
The  left  part  of  the  figure  illustrates  the  results  obtained 
with  traditional  subtree-swapping  crossover.  Clearly,  each  of 
the  considered  diversity  promoting  methods  significantly  im¬ 
proves  the  performance  of  the  standard  GP  algorithm.  The 
best  performance  is  achieved  either  by  DFPO  or  ASNFPO, 
depending  on  the  problem.  The  impact  of  promoting  diver¬ 
sity  on  the  fitness  of  evolved  solutions  is  much  less  clear  in 
the  case  of  LGX  crossover  (right  part  of  Figure  2).  The  only 
method  that  consistently  improves  the  results  of  the  baseline 
GP  algorithm  on  all  considered  problems  is  ASNFPO.  All 
the  other  diversity  preserving  approaches  are  detrimental  to 
the  search  performance  at  least  on  some  benchmarks. 

Further  observations  can  be  made  by  comparing  the  re¬ 
sults  achieved  by  the  same  algorithm  but  equipped  with 
different  crossover  operators.  The  largest  performance  im¬ 
provement  is  observed  for  the  standard  GP  algorithm,  which 
when  equipped  with  the  LGX  operator,  achieves  signifi¬ 
cantly  higher  convergence  speed  and  final  performance.  As 
a  matter  of  fact,  it  converges  so  quickly,  that  if  the  runs 
were  stopped  after  100  generations,  it  would  be  the  best  of 
the  considered  setups.  Besides  GP,  the  only  other  method 
that  regularly  benefits  from  replacing  traditional  syntactic 
crossover  with  geometric  semantic  crossover  is  ASNFPO. 
Importantly,  the  synergistic  interplay  of  LGX  crossover  and 
the  angular  semantic  novelty  objective  leads  to  the  best  over¬ 
all  results  in  terms  of  the  ultimate  achieved  fitness. 

5.2  Diversity  Analysis 

To  analyze  the  relationship  between  behavioral  diversity 
and  fitness,  we  assessed  diversity  of  populations  evolved  by 
particular  methods  using  measures  listed  in  Section  4.4.  Ta¬ 
ble  3  shows  Spearman  correlation  coefficients  calculated  be¬ 
tween  behavioral  diversity  measured  at  selected  generations 
and  best  fitness  in  the  last  generation  of  each  run. 

In  the  context  of  the  Euclidean  semantic  distance  mea¬ 
sure  (left  part  of  Table  3),  correlation  is  stronger  for  seman¬ 
tic  crossover  than  for  standard  syntactic  crossover.  More 
importantly,  at  the  end  of  runs  correlation  is  positive  — 
large  Euclidean  semantic  diversity  is  seen  with  high  (bad) 
fitness  values.  This  observation  is  consistent  with  relatively 
weak  performance  achieved  by  ESNFPO  method  which  uses 
Euclidean  semantic  novelty  objective  to  promote  behavioral 
diversity.  Taken  together,  these  results  suggest  that  high 
levels  of  Euclidean  semantic  diversity  can  not  be  considered 
as  being  generally  beneficial  to  the  search  performance. 

Moreover,  it  can  be  noticed  that  at  the  beginning  of  runs 
correlation  coefficients  are  much  lower  (sometimes  even  neg¬ 
ative)  and  only  later  start  to  increase.  Therefore,  behav¬ 
ioral  diversity  may  play  different  role  at  different  evolution¬ 
ary  times.  Indeed,  further  analysis  revealed  that  in  the  most 
successful  runs,  Euclidean  semantic  diversity  stays  relatively 
high  in  the  early,  exploratory  phase  of  evolution  but  then 
gradually  decreases  which  corresponds  to  exploitation  of  the 
most  promising  parts  of  the  behavioral  space.  Thus,  high  di¬ 
versity  at  the  beginning  of  the  evolution  may  be  not  only  less 
harmful  but  even  advantageous.  On  the  other  hand,  keep¬ 
ing  diversity  high  throughout  entire  runs  typically  leads  to 
inferior  performance. 
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Figure  2:  Average  best  fitness  achieved  by  different 
variants  of  multiobjective  GP  equipped  with  either 
standard  syntactic  crossover  (left  column)  or  locally 
geometric  semantic  crossover  (right  column). 
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Table  3:  Correlation  between  best  fitness  (lowest  error)  in  the  last  generation  and  behavioral  diversity  mea¬ 
sured  at  selected  generations  as:  1)  median  euclidean  semantic  distance  2)  mean  angular  semantic  distance. 


Median  Euclidean  Semantic  Distance 

Mean  Angular  Semant 

ic  Distance 

Standard  syntact: 

ic  crossover 

Geometric  semantic  crossover 

Standard  s 

yntactic 

crossover 

Geometric  semantic 

crossover 

QUA 

NON 

KEI 

Rl 

r2 

QUA 

NON 

KEI 

Rl 

R2 

QUA 

NON 

KEI 

Rl 

R2 

QUA 

NON 

KEI 

Rl 

R2 

0 

-.017 

-.034 

-.003 

-.003 

+  .031 

+  .002 

+  .044 

-.012 

-.066 

+  .003 

+  .010 

-.002 

+  .042  - 

.010 

+  .041 

-.031 

+  .061 

-.015  - 

.101 

+  .001 

10 

+  .010 

-.003 

-.283 

-.227 

-.353 

-.119 

-.203 

+  .004 

-.127 

-.131 

+  .153 

+  .173 

-.427  - 

.009 

-.220 

-.464 

-.533 

-.115  - 

.499 

-.421 

.2 

25 

-.072 

-.082 

-.250 

-.222 

-.201 

-.109 

-.272 

+  .036 

+  .026 

-.196 

-.263 

-.187 

-.375  - 

.300 

-.547 

-.585 

-.642 

-.324  - 

.610 

-.570 

50 

+  .021 

+  .071 

-.261 

-.037 

-.078 

-.065 

-.222 

-.227 

+  .232 

-.300 

-.392 

-.358 

-.477  - 

.448 

-.609 

-.651 

-.675 

-.607  - 

.680 

-.624 

0 

c 

100 

+  .018 

+  .109 

-.242 

+  .015 

-.002 

+  .119 

+  .027 

-.233 

+  .334 

-.080 

-.457 

-.450 

-.600  - 

.515 

-.649 

-.654 

-.647 

-.610  - 

.677 

-.599 

0 

b0 

250 

+  .087 

+  .204 

-.121 

+  .134 

+  .134 

+  .393 

+  .381 

+  .000 

+  .502 

+  .214 

-.519 

-.441 

-.639  - 

.533 

-.650 

-.630 

-.560 

-.599  - 

.607 

-.578 

500 

+  .123 

+  .265 

-.031 

+  .204 

+  .180 

+  .489 

+  .567 

+  .177 

+  .558 

+  .257 

-.587 

-.463 

-.663  - 

.478 

-.635 

-.533 

-.456 

-.513  - 

.506 

-.530 

1000 

+  .175 

+  .296 

+  .020 

+  .243 

+  .229 

+  .630 

+  .694 

+  .251 

+  .567 

+  .380 

-.549 

-.411 

-.658  - 

.376 

-.607 

-.284 

-.251 

-.266  - 

.301 

-.324 

The  second  form  of  behavioral  diversity  we  investigate  is 
angular  semantic  diversity.  The  right  part  of  Table  3  il¬ 
lustrates  relatively  strong  negative  correlation  between  this 
diversity  measure  and  final  fitness  of  evolved  programs,  re¬ 
gardless  of  the  type  of  employed  crossover  operator.  Since 
high  levels  of  angular  semantic  diversity  are  frequently  seen 
with  low  (good)  fitness,  we  can  hypothesize  that  this  form 
of  diversity  facilitates  genetic  programming.  Together  with 
high  performance  of  the  ASNFPO  method,  these  results 
provide  empirical  evidence  that  angular  semantic  diversity 
tends  to  be  more  useful  than  Euclidean  semantic  diversity. 

5.3  Generalization  Performance 

In  order  to  assess  generalization  performance  of  evolved 
programs,  we  calculated  the  root-mean-square  error  com¬ 
mitted  by  the  best-of-run  individuals  on  1  000  tests  drawn 
uniformly  from  the  same  range  as  for  the  training  set.  Table 
4  shows  median  training  error,  test  error  and  size  (number 
of  nodes)  of  the  individuals  evolved  by  particular  methods. 
To  confirm  statistically  significant  differences  between  the 
results  obtained  by  the  five  compared  GP  variants,  for  each 
problem  and  crossover  operator  we  conducted  the  Kruskal- 
Wallis  test  followed  by  a  post-hoc  analysis  using  pairwise 
Mann- Whitney  tests  (with  sequential  Bonferroni  correction). 
We  set  the  level  of  significance  at  p  <  0.05.  Table  4  shows 
with  an  underline  the  results  that  were  found  significantly 
better  than  those  achieved  by  every  other  GP  variant. 

On  most  problems,  the  significantly  lowest  test  error  is 
obtained  by  either  DFPO  or  ASNFPO.  Interestingly,  while 
DFPO  achieves  the  highest  generalization  performance  in 
the  context  of  standard  crossover,  ASNFPO  is  the  winner 
among  methods  paired  with  the  LGX  operator.  These  re¬ 
sults  suggest  that  there  is  a  synergy  between  particular  vari¬ 
ation  operators  and  diversity  promoting  methods.  Tradi¬ 
tional  syntactic  crossover  is  able  to  exploit  structural  di¬ 
versity  maintained  by  DFPO,  whereas  semantic  crossover 
benefits  from  angular  semantic  diversity.  Another  impor¬ 
tant  observation  is  that  ASNFPO  is  the  only  method  that 
achieves  higher  generalization  performance  than  standard 
GP  on  all  problems,  regardless  of  the  crossover  operator. 

Finally,  by  comparing  training  and  test  errors  achieved 
on  particular  benchmarks,  we  can  observe  that  AFPO  and 
DFPO  methods  overfit  less  than  the  other  methods.  One 
reason  explaining  less  severe  overfitting  is  that  these  two 
methods  tend  to  produce  shorter  programs  than  the  other 
methods  (especially  when  equipped  with  the  LGX  opera¬ 
tor).  In  particular,  AFPO  usually  produces  the  significantly 
smallest  trees  among  the  considered  methods. 


6.  DISCUSSION 

One  of  the  most  interesting  findings  from  experiments 
is  the  discrepancy  between  results  obtained  with  different 
crossover  operators  (see  left  vs.  right  part  of  Fig.  2  and  up¬ 
per  vs.  lower  part  of  Table  4).  With  traditional  crossover, 
all  the  considered  diversity  promoting  methods  improve  the 
performance  of  standard  GP.  On  the  other  hand,  with  geo¬ 
metric  crossover,  ASNFPO  is  the  only  algorithm  that  consis¬ 
tently  outperforms  standard  GP  on  all  five  symbolic  regres¬ 
sion  problems.  These  findings  raise  the  following  questions: 
Why  is  angular  semantic  novelty  so  effective  in  the  context 
of  geometric  crossover?  Why  are  other  diversity  objectives 
beneficial  with  one  crossover  operator  while  being  detrimen¬ 
tal  with  another?  We  attempt  to  answer  these  questions  by 
referring  to  the  notion  of  fitness-diversity  antagonism. 

For  the  purpose  of  this  discussion,  let  us  say  that  there  is 
an  antagonism  between  fitness  and  diversity  in  a  given  pop¬ 
ulation  if  improving  fitness  of  any  single  individual  is  im¬ 
possible  without  reducing  population  diversity.  Under  this 
definition,  angular  semantic  diversity  is  never  antagonistic 
with  fitness.  Indeed,  by  moving  program  semantics  straight 
in  the  direction  of  the  target  (along  residual  vectors),  angu¬ 
lar  semantic  diversity  does  not  change  while  fitness  of  any 
solution  can  be  arbitrarily  improved.  In  contrast,  Euclidean 
semantic  diversity  is  at  least  sometimes  antagonistic  with 
fitness  —  there  are  populations  which  can  not  be  optimized 
without  reducing  their  diversity.  Indeed,  minimizing  error 
pulls  individuals  toward  the  target  semantics  but  maximiz¬ 
ing  Euclidean  diversity  scatters  them  away  from  it. 

Intuitively,  one  could  expect  that  antagonistic  diversity 
objectives  would  be  detrimental  to  the  search  performance. 
However,  this  may  not  be  the  case  in  deceptive  fitness  land¬ 
scapes,  where  local  fitness  gradient  is  misleading.  In  such  a 
situation,  increasing  semantic  distance  to  the  target  (fitness) 
can  in  fact  reduce  the  distance  to  the  target  measured  in 
the  search  space  seen  by  specific  operators.  We  hypothesize 
that  semantically-blind  standard  crossover  induces  relatively 
rugged  and  deceptive  landscape.  This  hypothesis  would  to 
some  extent  explain  why  any  type  of  diversity  objective,  re¬ 
gardless  of  its  antagonism,  improves  the  search  performance 
in  the  context  of  this  crossover  operator. 

On  the  other  hand,  according  to  Moraglio  et  al.  [22],  ge¬ 
ometric  semantic  operators  see  cone  landscapes  which  are 
easy  to  search  by  fitness  objective  alone  as  they  are  not 
deceptive  at  all.  Even  though  we  employ  approximately  ge¬ 
ometric  crossover,  we  expect  that  the  corresponding  fitness 
landscape  is  still  much  smoother  than  the  one  induced  by 
traditional  crossover.  In  such  landscapes  fitness-diversity 
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Table  4:  Median  training  error,  test  error  and  size  of  best-of-run  individuals.  For  each  problem  and  crossover 


operator  the  best  results  are  shown  in  bold.  Underline  indicates  statistically  significant  superiority. 
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94 
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92 
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75 

u 

a 
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72 
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84 
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69 
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77 

73 

DFPO 
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cC 
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87 
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0 
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0.007 
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GP 
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0.019 
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0.005 
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0.034 
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63 
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66 
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61 
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80 
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0.005 
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antagonism  is  much  more  likely  to  be  detrimental.  This 
would  explain  weak  performance  achieved  by  using  antago¬ 
nistic  Euclidean  semantic  novelty  objective.  Since  angular 
semantic  novelty,  by  contrast,  is  the  only  diversity  objective 
known  to  be  non-antagonistic,  it  proves  successful  in  the 
context  of  the  geometric  crossover  operator. 

Finally,  let  us  discuss  two  other  reasons  that  could  ex¬ 
plain  aforementioned  discrepancy  in  results.  First,  by  ana¬ 
lyzing  how  fitness  of  perfectly  geometric  offspring  depends 
on  the  angular  distance  between  its  parents  (cf.  Fig.  1), 
we  expect  that  geometric  crossover  operator  is  able  to  effec¬ 
tively  exploit  angular  semantic  diversity.  Another  synergis¬ 
tic  combination  involves  structural  diversity  (promoted  by 
the  DFPO  algorithm)  and  traditional  syntactic  crossover  op¬ 
erator.  Both  combinations  of  diversity  objective  and  search 
operator  result  in  superior  performance  when  compared  to 
other  considered  methods.  Second,  the  reason  why  diversity 
maintenance  plays  such  an  important  (and  beneficial)  role 
in  GP  equipped  with  traditional  crossover  is  that  in  our  ex¬ 
periments  we  do  not  employ  any  mutation  operator  which 
could  supply  new  genetic  material  and  explicitly  sustain  ge¬ 
netic  diversity  in  a  population.  In  absence  of  mutation,  we 
expect  that  standard  GP  with  subtree-swapping  crossover 
is  particularly  vulnerable  to  the  problem  of  premature  con¬ 
vergence.  This  problem  is  less  severe  with  locally  geometric 
crossover  because  it  relies  on  a  large  library  of  procedures 
which  provides  the  population  with  new  subtrees  acting  as 
a  simple  diversity  preserving  mechanism. 

7.  CONCLUSIONS 

In  recent  years,  the  issue  of  behavioral  diversity  and  its 
impact  on  the  performance  of  evolutionary  algorithms  has 
been  studied  in  many  different  contexts  [17,  23,  29].  To  the 
best  of  our  knowledge,  this  is  the  first  study  that  investi¬ 
gates  the  role  of  behavioral  diversity  in  genetic  programming 
equipped  with  semantic  search  operators.  The  main  goal  of 
this  work  was  to  determine  whether  and  under  what  con¬ 
ditions  promoting  behavioral  diversity  can  adversely  affect 
the  performance  of  GP  applied  to  symbolic  regression. 

The  most  important  finding  is  that  using  an  additional 
diversity  promoting  objective  can  be  indeed  detrimental  to 
the  search  performance.  However,  such  a  situation  was  ob¬ 
served  only  when  both  of  the  following  conditions  were  met. 
First,  a  specific  search  operator  was  employed,  which  sup¬ 
posedly  induced  a  smooth,  non-deceptive  fitness  landscape. 
Second,  the  behavioral  diversity  objective  was  inherently  an¬ 


tagonistic  with  the  fitness  objective.  On  the  other  hand,  by 
introducing  a  non-antagonistic  angular  semantic  novelty  ob¬ 
jective,  we  were  able  to  improve  the  results  regardless  of  the 
employed  search  operator.  Importantly,  this  objective  was 
the  only  one  that  proved  successful  in  the  context  of  locally 
geometric  crossover  operator. 

The  major  limitation  of  this  study  is  that  our  experimen¬ 
tal  investigations  were  conducted  using  a  small  set  of  five 
univariate  symbolic  regression  benchmarks.  Although  pro¬ 
moting  angular  semantic  diversity  proved  useful  in  this  con¬ 
text,  further  work  is  needed  to  verify  whether  these  results 
could  be  extended  to  more  complex  real-world  problems. 
In  particular,  it  would  be  interesting  to  analyze  how  much 
dimensionality  of  both  feature  space  and  semantic  space 
impacts  the  performance  of  particular  diversity  promoting 
methods.  Another  direction  of  future  research  would  be  to 
investigate  the  importance  of  behavioral  diversity  for  other 
semantic  search  operators. 

In  a  broader  perspective,  our  investigation  indicates  that  a 
diversity  objective  needs  to  be  carefully  chosen  with  respect 
to  the  problem  at  hand  and  employed  search  algorithm.  As 
demonstrated  by  this  study,  using  objectives  that  are  an¬ 
tagonistic  with  fitness  was  detrimental  to  the  performance 
of  semantic  GP.  However,  we  expect  that  with  increasing  de¬ 
ceptiveness  in  the  fitness  landscape,  the  consequences  of  us¬ 
ing  antagonistic  objectives  become  more  difficult  to  predict. 
In  particular,  one  could  hypothesize  that  in  highly  decep¬ 
tive  fitness  landscapes  antagonistic  diversity  objectives  are 
more  likely  to  be  beneficial.  This  would  be  consistent  with 
previous  studies  demonstrating  that  in  extremely  deceptive 
cases  a  successful  way  to  increase  search  effectiveness  is  to 
ignore  fitness  and  use  a  novelty  objective  alone  [16]. 
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5.5  Szubert  et  al.  “Semantic  forward  propagation...”  (2016). 

A  technical  manuscript  describing  how  to  de -randomize  model  perturbations  to  improve  optimiza¬ 
tion  follows. 
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Abstract.  In  recent  years,  a  number  of  methods  have  been  proposed 
that  attempt  to  improve  the  performance  of  genetic  programming  by  ex¬ 
ploiting  information  about  program  semantics.  One  of  the  most  impor¬ 
tant  developments  in  this  area  is  semantic  backpropagation.  The  key  idea 
of  this  method  is  to  decompose  a  program  into  two  parts  —  a  subprogram 
and  a  context  —  and  calculate  the  desired  semantics  of  the  subprogram 
that  would  make  the  entire  program  correct,  assuming  that  the  context 
remains  unchanged.  In  this  paper  we  introduce  Forward  Propagation 
Mutation,  a  novel  operator  that  relies  on  the  opposite  assumption  —  in¬ 
stead  of  preserving  the  context,  it  retains  the  subprogram  and  attempts 
to  place  it  in  the  semantically  right  context.  We  empirically  compare 
the  performance  of  semantic  backpropagation  and  forward  propagation 
operators  on  a  set  of  symbolic  regression  benchmarks.  The  experimental 
results  demonstrate  that  semantic  forward  propagation  produces  smaller 
programs  that  achieve  significantly  higher  generalization  performance. 

Keywords:  genetic  programming,  program  semantics,  semantic  back- 
propagation,  problem  decomposition,  symbolic  regression 


1  Introduction 

Standard  tree-based  genetic  programming  (GP)  searches  the  space  of  programs 
using  traditional  operators  of  subtree-swapping  crossover  and  subtree-replacing 
mutation  [4].  These  operators  are  designed  to  be  generic  and  produce  syntacti¬ 
cally  correct  offspring  regardless  of  the  problem  domain.  However,  their  actual 
effects  on  the  behavior  of  the  program,  and  thus  its  fitness,  are  generally  hard  to 
predict.  For  this  reason,  many  alternative  search  operators  have  been  recently 
proposed  that  take  into  account  the  influence  of  syntactic  modifications  on  pro¬ 
gram  semantics  [1,11,10,13]. 

Semantic  backpropagation  [12,15]  is  arguably  one  of  the  most  powerful  tech¬ 
niques  employed  by  such  semantic-aware  GP  operators.  The  two  operators  based 
on  semantic  backpropagation  -  Random  Desired  Operator  (RDO)  and  Approx¬ 
imately  Geometric  Crossover  (AGX)  have  proved  to  be  successful  on  a  number 
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of  symbolic  regression  and  boolean  program  synthesis  problems  [11,12].  Both  op¬ 
erators  rely  on  semantic  decomposition  of  an  existing  program  into  two  parts  — 
a  subprogram  and  its  context.  Given  a  subprogram,  both  operators  attempt  to 
calculate  its  desired  semantics ,  i.e.,  the  values  that  it  should  return  to  make  the 
entire  program  produce  the  desired  output,  assuming  that  the  context  remains 
unchanged.  The  desired  semantics  can  be  then  used  to  find  a  replacement  for 
the  subprogram  that  improves  the  overall  program  behavior. 

Despite  their  superior  performance  when  compared  to  other  GP  search  oper¬ 
ators  [15,12,11],  backpropagation-based  RDO  and  AGX  face  a  few  major  chal¬ 
lenges  that  can  limit  their  practical  applicability.  First  of  all,  they  are  much 
more  computationally  expensive  than  traditional  syntactic  operators.  Indeed, 
in  order  to  calculate  desired  semantics,  the  target  program  output  needs  to  be 
backpropagated  by  traversing  the  tree  and  inverting  the  execution  of  particular 
instructions.  The  computational  cost  of  this  operation  is  similar  to  the  cost  of 
a  single  fitness  evaluation  (which  is  typically  the  most  expensive  component  of 
GP).  Moreover,  using  desired  semantics  to  find  a  subprogram  replacement  usu¬ 
ally  requires  even  more  computational  effort.  Finally,  the  results  reported  so  far 
demonstrate  that  RDO  and  AGX  tend  to  produce  relatively  large  programs  that 
are  difficult  to  interpret  and  may  suffer  from  overfitting. 

In  this  paper,  we  introduce  Forward  Propagation  Mutation  (FPM),  a  novel 
semantic-aware  operator  that  also  relies  on  program  decomposition  but  works 
in  the  opposite  manner  to  semantic  backpropagation.  Instead  of  preserving  the 
context  and  replacing  the  subprogram,  forward  propagation  retains  the  subpro¬ 
gram  and  attempts  to  place  it  in  the  semantically  right  context.  In  contrast  to 
semantic  backpropagation,  the  FPM  operator  does  not  require  an  additional  tree 
traversal  and  thus  it  incurs  less  computational  overhead.  Moreover,  the  experi¬ 
mental  results  obtained  on  a  set  of  univariate  and  bivariate  symbolic  regression 
problems  demonstrate  that  it  achieves  competitive  performance  in  terms  of  the 
training  error  while  producing  much  smaller  programs  that  usually  perform  sig¬ 
nificantly  better  on  the  unseen  test  cases. 

2  Semantic  Genetic  Programming 

In  order  to  incorporate  semantic-awareness  into  genetic  programming,  most  of 
the  recently  proposed  methods  adopt  a  common  definition  of  program  semantics, 
known  as  sampling  semantics  [13],  which  is  identified  with  the  vector  of  outputs 
produced  by  a  program  for  a  sample  of  possible  inputs.  In  supervised  learning 
problems  considered  here,  where  n  input-output  pairs  are  given  as  a  training  set 
T  =  {(xi,  yi), . . . ,  (xra,  yn)},  semantics  of  a  program  p  is  equal  to  vector  s(p)  = 
[p(xj), . . .  ,p(x„)],  where  p(x)  is  a  result  obtained  by  running  program  p  on 
input  x.  Consequently,  each  program  p  corresponds  to  a  point  in  n-dimensional 
semantic  space  and  a  metric  d  can  be  adopted  to  measure  semantic  distance 
between  two  programs.  Furthermore,  fitness  of  a  program  p  can  be  calculated  as 
a  distance  between  its  semantics  s(p)  and  the  target  semantics  t  =  [y-\ , . . .  ,yn) 
defined  by  the  training  set,  i.e.,  f(p)  =  d(s(p)1t). 
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The  information  about  program  semantics  and  the  structure  of  the  semantic 
space  endowed  by  a  metric-based  fitness  function  can  be  exploited  in  many 
ways  to  facilitate  the  search  process  carried  out  by  GP.  Apart  from  numerous 
semantic  search  operators  [1,13,10,11],  the  knowledge  about  semantics  can  be 
used  to  maintain  population  diversity  [3],  to  initialize  the  population  [2]  or  to 
drive  the  selection  process  [7] .  All  such  semantic-aware  methods  are  collectively 
captured  by  the  umbrella  term  of  semantic  genetic  programming  [14].  Recently,  a 
paradigm  of  behavioral  program  synthesis  [5]  has  been  proposed,  which  extends 
semantic  GP  by  using  information  not  only  about  final  program  results  but  also 
about  behavioral  characteristics  of  program  execution. 

3  Semantic  Backpropagation 

One  of  the  most  important  methods  in  semantic  GP  is  semantic  backpropagation 
[12].  The  key  concept  behind  this  method  is  program  decomposition:  a  program 
p  is  treated  as  a  function  (i.e. ,  it  is  deterministic  and  has  no  side  effects)  that  can 
be  decomposed  into  two  constituent  functions  (subprograms)  p-\  and  P2  such  that 
p{x)  =  P2  (pi  (x) ,  x) ) .  In  particular,  if  a  program  is  represented  as  a  tree,  such 
decomposition  can  be  made  at  each  node  —  the  inner  function  pi  is  expressed 
by  the  subtree  rooted  at  the  given  node,  while  the  outer  function  p2  corresponds 
to  the  rest  of  the  tree  (also  termed  context  [9],  see  left  part  of  Fig.  1). 

Semantic  backpropagation  assumes  that  the  desired  program  output  p*(x) 
can  be  produced  by  retaining  the  outer  function  and  replacing  just  the  inner  one 
by  another  subprogram  ps,  i.e.,  p*(x)  =  p2(ps(x),x)).  Starting  from  the  desired 
program  output  p*(x),  the  backpropagation  algorithm  heuristically  inverts  the 
program  execution  to  calculate  the  desired  semantics  of  the  subprogram  ps,  i.e., 
the  values  it  should  produce  to  make  the  entire  program  correct.  This  idea  has 
been  employed  to  design  two  operators,  AGX  and  RDO,  which  differ  with  respect 
to  what  they  use  as  the  desired  program  output  p*{x).  In  this  study,  we  focus  on 
RDO,  a  mutation  operator  that  assumes  that  target  semantics  t  =  [y i, . . .  ,yn) 
is  given  a  priori  and  thus  values  p*(xi)  =  y,  can  be  used  as  an  input  for  the 
backpropagation  algorithm. 

An  example  of  a  mutation  performed  by  RDO  is  illustrated  in  Fig.  1  and 
proceeds  as  follows.  First,  a  random  mutation  node  is  selected  in  the  parent 
program  (denoted  as  a  circle  with  a  double  border  in  Fig.  1).  The  subtree  p\ 
rooted  at  this  node  is  removed  from  the  tree  and  the  backpropagation  algorithm 
is  applied  to  calculate  the  desired  semantics  of  the  replacement  ps  that  would 
make  the  offspring  program  return  desired  values.  The  algorithm  starts  from 
the  root  of  the  tree,  where  desired  semantics  is  given  by  t,  and  follows  the  path 
to  the  removed  subtree.  For  each  node  it  calculates  the  desired  semantics  of  its 
child  by  invoking  the  Invert  function  (a  detailed  description  of  this  function 
and  the  RDO  operator  in  general  can  be  found  in  [12]). 

For  instance,  let  us  assume  that  a  training  set  contains  just  two  cases  with 
inputs  x  =  [1,2]  and  desired  outputs  t  =  [0,2].  As  shown  in  Fig.  1,  in  the 
first  step  the  algorithm  finds  out  that  to  produce  desired  semantics  at  the  root, 


107 

Approved  for  public  release;  distribution  is  unlimited. 


Fig.  1.  A  mutation  performed  by  Random  Desired  Operator  using  semantic  backprop- 
agation.  Desired  semantics  are  denoted  in  italics. 


knowing  that  outputs  of  its  right  child  are  equal  to  [1,1],  the  desired  semantics  of 
the  left  child  must  be  equal  to  [1,3].  This  result  is  used  in  the  subsequent  step  to 
calculate  desired  semantics  for  the  next  node.  Finally,  given  desired  semantics  at 
the  mutation  node,  the  RDO  operator  attempts  to  replace  the  removed  subtree 
with  a  subprogram  that  would  produce  such  values.  To  this  end,  it  employs  a 
precomputed  library  of  programs  (procedures)  that  allows  to  efficiently  retrieve 
a  program  p*  that  has  the  smallest  semantic  distance  to  the  desired  semantics. 
Additionally,  RDO  also  checks  if  a  single  constant  real  value  would  provide  a 
better  match  to  the  desired  semantics  than  p*. 

Importantly,  in  the  process  of  semantic  backpropagation,  inverting  certain 
functions  can  be  ambiguous  (if  the  function  is  not  injective)  or  impossible  (if 
the  function  is  not  surjective).  As  a  result,  the  desired  semantics  may  contain 
several  values  for  each  training  case  or  special  inconsistent  elements.  The  library 
must  be  able  to  handle  such  queries  efficiently  [12,15]. 

4  Semantic  Forward  Propagation 

Inspired  by  semantic  backpropagation  and  RDO  we  propose  an  alternative  muta¬ 
tion  operator  based  on  the  complementary  idea,  which  we  term  semantic  forward 
propagation.  Similarly  to  RDO,  Forward  Propagation  Mutation  (FPM)  relies  on 
decomposability  of  a  program  p  into  a  subtree  pi  and  a  context  P2-  However, 
while  RDO  assumes  that  a  context  can  be  preserved  and  attempts  to  replace 
the  subtree,  FPM  makes  the  opposite  assumption  preserving  the  subtree  and 
building  a  matching  context  for  it. 

The  FPM  operator  starts  by  choosing  a  random  mutation  node  in  the  parent 
program.  The  subtree  pi  rooted  at  this  node  is  extracted  from  the  tree  and  used 
as  a  starting  point  for  creating  an  offspring.  In  order  to  build  a  new  context  for 
this  subtree,  we  assume  a  fixed  structure  of  the  context  pc  containing  4  new  nodes 
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Fig.  2.  An  operation  performed  by  Forward  Propagation  Mutation. 


and  a  matching  library  procedure  (see  Fig.  2).  We  apply  an  exhaustive  search 
to  identify  a  context  p*  of  the  assumed  structure,  that  minimizes  fitness  of  the 
entire  offspring  program  p*  =  argminpc  f(pc  o  p^.  To  this  end,  we  consider  all 
pairwise  combinations  of  the  available  unary  (e.g.,  {sin,  cos,  log,  exp})  and  binary 
functions  (e.g.,  {x,+,  — ,/})  that  could  be  placed  directly  above  the  selected 
subtree,  as  nodes  u  and  b,  respectively  (cf.  Fig.  2).  Importantly,  we  extend  the 
unary  function  set  with  the  identity  function  id{x)  =  x.  If  the  best  found  context 
p*  uses  this  function  we  skip  adding  the  node  u  to  the  tree.  For  each  pair  of 
functions  ( u ,  b )  placed  above  the  subtree  pi ,  we  forward  propagate  the  semantics 
of  the  subtree  up  to  the  root  of  the  new  tree.  Then,  we  apply  just  a  single 
backpropagation  step,  using  the  same  Invert  function  as  in  RDO,  to  calculate 
desired  semantics  d  of  the  other  child  of  the  node  b ,  given  the  the  target  semantics 
t  and  the  forward-propagated  semantics  s(it  o  pf). 

Since  in  this  case  the  desired  semantics  is  usually  unambiguous,  we  can  use  a 
different  method  of  searching  the  library,  which  could  not  be  easily  applied  within 
the  RDO  operator.  Here,  we  search  for  the  library  procedure  which  achieves  high¬ 
est  cosine  similarity.  In  other  words,  if  we  treat  semantics  as  an  n-dimensional 
vector,  we  return  library  procedure  p}  that  makes  the  smallest  angle  with  the 
desired  semantics  d,  i.e.: 


Pi  =  arg  min  arccos 

Pt&L 


s [pi)  •  d 

llsfe)lllldl 


Finally,  we  add  a  constant  node  c  to  scale  the  semantics  of  the  library  procedure 
making  it  closer  to  the  desired  semantics,  i.e.,  c  =  (s(p{)  •  d)  /  ||s(p})||2.  An 
alternative,  more  computationally  expensive  approach,  would  be  to  run  simple 
linear  regression  for  each  candidate  program  in  the  library,  using  its  semantics 
as  a  single  explanatory  variable  and  desired  semantics  d  as  a  response.  This 
approach  would  require  extending  the  context  structure  to  accommodate  both 
an  intercept  and  a  slope  coefficient. 
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5  Experimental  setup 

The  main  goal  of  the  experiments  is  to  compare  the  performance  of  RDO  and 
FPM  mutation  operators  on  a  suite  of  symbolic  regression  benchmarks.  Addi¬ 
tionally,  as  a  control  setup  we  employ  traditional  subtree-replacing  mutation 
(SRM).  All  three  mutation  operators  are  used  along  with  conventional  subtree¬ 
swapping  crossover  in  a  standard  generational  tree-based  GP  algorithm  with 
tournament  selection.  Each  mutation  operator  is  employed  in  five  setups  with 
different  values  of  mutation  and  crossover  probabilities  (the  source  code  of  our 
experiments  is  available  at  https : //github .  com/mszubert/ppsn_2016). 

Most  of  the  GP  parameters  (summarized  in  Table  1)  are  adopted  from  the 
recent  work  on  semantic  backpropagation  [12].  In  particular,  whenever  a  random 
mutation/crossover  node  needs  to  be  selected,  a  uniform  depth  node  selector  is 
used.  Given  a  program  p,  it  first  calculates  program’s  height  h,  then  draws 
uniformly  an  integer  d  from  the  interval  [0,  h]  and  finally  selects  a  random  node 
from  all  nodes  at  depth  d  in  program  p.  This  technique  has  been  recently  shown 
to  reduce  bloat  when  compared  to  conventional  Koza-I  node  selectors  [6,12]. 

Moreover,  both  RDO  and  FPM  use  population-based  library  which  is  con¬ 
structed  at  each  generation  from  all  semantically  unique  subtrees  (subprograms) 
in  the  current  population.  Since  we  impose  an  upper  limit  on  the  tree  height  (17), 
when  searching  the  library  we  ignore  all  the  procedures  that  would  violate  this 
constraint  when  inserted  into  the  parent  program. 

We  investigate  training  error,  generalization  performance  (error  on  1  000  un¬ 
seen  test  cases)  and  the  size  of  programs  produced  by  using  particular  muta¬ 
tion  operators  on  11  symbolic  regression  benchmarks.  We  consider  six  univari¬ 
ate  and  five  bivariate  problems  that  are  adopted  from  previous  studies  [4,8,12]. 
Selected  benchmarks  (see  Table  2)  include  polynomial,  rational  and  trigono¬ 
metric  functions.  For  each  problem,  fitness  was  calculated  as  root-mean-square 
error  on  a  number  of  training  cases.  The  univariate  problems  use  20  cases  dis¬ 
tributed  equidistantly  in  the  [—1, 1]  range,  while  the  bivariate  ones  use  a  grid  of 
10  x  10  =  100  points  spaced  evenly  in  the  [—1, 1]  x  [—1,1]  square. 


Table  1.  Genetic  programming  parameters 


Parameter 

Value 

population  size 

256 

generations 

100 

initialization 

ramped  half-and-half  with  height  range  2  —  6 

100  retries  until  accepting  a  syntactic  duplicate 

instruction  set 

,  x ,  /,  exp,  log,  sin,  cos}  (log  and  /  are  protected) 

tournament  size 

7 

fitness  function 

root-mean-square  error  (RMSE) 

node  selection 

uniform  depth  node  selector 

maximum  tree  height 

17 

number  of  runs 

30 

110 
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Table  2.  Symbolic  regression  benchmarks. 


Benchmark  name 

Objective  function 

Variables 

Training  cases 

P4  (Quartic) 

x4  +  x3  +  x2  +  X 

1 

20 

P7  (Septic) 

x7  —  2x6  +  x5  —  x4  +  x3  —  2x2  +  x 

1 

20 

P9  (Nonic) 

Ex*4 

1 

20 

R1 

(x  +  l)3/{x2  —  x  +  1) 

1 

20 

R2 

(x5  —  3a;3  +  l)/(a;2  +  1) 

1 

20 

R3 

(x6  +  x5)/(x4  +  x3  +  x2  +  x  +  1) 

1 

20 

Kll  (Keijzer-11) 

xy  +  sin((a;  —  1  ){y  -  1)) 

2 

100 

K12  (Keijzer-12) 

x4  —  x3  +  \ —  y 

2 

100 

K13  (Keijzer-13) 

6sin(a:)  cos (y) 

2 

100 

K14  (Keijzer-14) 

8 

2+x'2+y2 

2 

100 

K15  (Keijzer-15) 

S» 

+ 

”8h 

2 

100 

6  Results  and  Discussion 

Table  3  presents  detailed  characteristics  of  the  best-of-run  individuals  evolved 
with  particular  mutation  operators.  Each  row  of  the  table  corresponds  to  a  sin¬ 
gle  combination  of  one  of  the  five  GP  setups  (with  different  crossover  (X)  and 
mutation  (M)  probabilities)  and  one  of  the  three  considered  mutation  opera¬ 
tors  (either  FPM,  RDO  or  SRM).  We  performed  30  independent  GP  runs  for 
each  of  such  15  combinations  on  each  of  the  11  symbolic  regression  problems. 
To  confirm  statistically  significant  differences  between  the  results  obtained  with 
particular  mutation  operators,  for  each  problem  and  parameters  setup  we  con¬ 
ducted  the  Kruskal- Wallis  test  followed  by  a  post-hoc  analysis  using  pairwise 
Mann- Whitney  tests  (with  sequential  Bonferroni  correction) .  We  set  the  level  of 
significance  at  p  <  0.05.  Table  3  shows  with  an  underline  the  results  that  were 
found  significantly  better  than  those  achieved  with  the  other  operators. 

The  first  part  of  Table  3  shows  the  average  training  errors.  Although  RDO 
achieves  the  best  overall  results  for  most  univariate  problems,  for  the  bivari¬ 
ate  ones  FPM  produces  more  competitive  results.  Regardless  of  the  parameter 
settings,  the  traditional  SRM  operator  leads  to  the  highest  training  error.  Note¬ 
worthy,  the  RDO  and  FPM  operators  obtain  their  best  results  under  different 
crossover  and  mutation  settings.  While  both  of  them  benefit  from  using  tradi¬ 
tional  crossover  as  an  additional  variation  operator,  the  performance  of  FPM  de¬ 
creases  when  mutation  is  performed  too  frequently  (i.e. ,  if  M  =  1.0).  To  explain 
this  phenomenon  let  us  note  that  for  a  given  subprogram,  the  FPM  operator 
builds  a  context  in  a  deterministic  way.  As  a  result,  if  two  semantically  equiva¬ 
lent  subprograms  are  selected  in  the  same  generation,  they  will  result  in  identical 
offspring.  Consequently,  FPM  can  lead  to  creating  too  many  duplicated  programs 
and  thus  losing  diversity  in  the  population.  Importantly,  although  RDO  is  also 
deterministic,  it  is  less  susceptible  to  this  problem  because  typically  the  number 
of  distinct  contexts  is  much  larger  than  that  of  distinct  subtrees. 
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In  order  to  assess  generalization  performance  of  evolved  programs,  we  cal¬ 
culate  the  root-mean-square  error  on  1  000  test  cases  drawn  uniformly  from  the 
same  range  as  for  the  training  cases.  The  median  test  errors  committed  by  the 
best-of-run  individuals  are  presented  in  the  second  part  of  Table  3.  In  most  cases, 
the  RDO  operator  (especially  for  setups  that  achieve  the  lowest  training  error) 
suffers  from  substantial  overfitting  resulting  in  large  test  error.  Although  the 
FPM  operator  is  also  vulnerable  to  overfitting  (in  particular  on  problem  P9) 
it  is  not  as  severe  as  in  the  case  of  RDO.  With  a  few  exceptions,  for  each  of 
the  considered  problems  and  parameter  setups,  the  FPM  operator  obtains  the 
highest  generalization  performance. 

Finally,  we  investigate  the  average  size  of  best-of-run  individuals  which  is 
presented  in  the  last  part  of  Table  3.  Not  surprisingly  RDO  is  the  most  bloating 
operator  and  this  is  one  of  the  reasons  for  its  poor  performance  on  the  unseen 
test  data.  On  the  other  hand,  in  preliminary  experiments  with  imposed  program 
size  limit  of  300  nodes,  we  also  observed  overfitting  of  the  RDO  operator.  The 
programs  produced  by  FPM  tend  to  be  much  smaller.  In  particular,  on  two  rel¬ 
atively  simple  problems,  P4  and  K13,  the  FPM  operator  finds  short  programs 
that  obtain  zero  test  error.  Apparently,  employing  FPM  allows  to  discover  so¬ 
lutions  that  are  very  close  to  the  original  function  underlying  the  training  data. 
However,  on  all  the  other  problems,  the  programs  produced  by  RDO  and  FPM 
are  significantly  larger  than  those  created  by  the  traditional  SRM  operator. 

7  Conclusions 

Semantic  GP  operators  have  proved  to  be  effective  on  a  number  of  symbolic 
regression  problems  [14,13,11].  In  this  study,  we  confirmed  these  observations  by 
analyzing  the  performance  of  the  RDO  operator  based  on  semantic  backpropa- 
gation  [12]  and  the  FPM  operator  that  employs  a  novel  idea  of  semantic  forward 
propagation.  When  applied  to  a  suite  of  symbolic  regression  benchmarks,  both 
operators  significantly  outperformed  the  subtree-replacing  mutation  operator 
conventionally  applied  in  GP.  However,  while  both  considered  semantic  opera¬ 
tors  achieved  competitive  performance  on  the  training  data,  the  RDO  operator 
was  found  much  more  susceptible  to  overfitting.  The  proposed  FPM  operator, 
on  the  other  hand,  consistently  produced  shorter  programs  that  obtained  signif¬ 
icantly  lower  error  on  the  unseen  test  data. 

Despite  achieving  superior  predictive  accuracy  and  producing  shorter  pro¬ 
grams  than  RDO,  the  programs  constructed  with  the  FPM  operator  are  still  too 
large  to  be  easily  understood.  This  is  unfortunate  since  finding  comprehensible 
solutions  has  been  always  considered  as  one  of  the  primary  benefits  of  using  GP 
instead  of  black-box  machine  learning  methods.  As  most  semantic-aware  oper¬ 
ators  tend  to  produce  large  or  very  large  programs  [10],  the  problem  of  bloat 
remains  the  major  challenge  that  can  limit  the  practical  applicability  of  such 
methods.  Therefore,  one  of  the  most  important  directions  of  future  work  is  to 
investigate  the  performance  of  RDO  and  FPM  operators  combined  with  parsi¬ 
mony  pressure  mechanisms  that  control  the  complexity  of  evolved  programs. 
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Table  3.  Detailed  characteristics  of  best-of-run  individuals  produced  by  particular 
mutation  operators  (FPM,  RDO,  SRM),  aggregated  over  30  GP  runs.  Each  operator 
was  employed  in  5  GP  setups  with  different  crossover  (X)  and  mutation  (M)  proba- 
bilites.  Bold  marks  the  best  results  achieved  under  certain  X/M  settings  on  particular 
problems.  Underline  indicates  statistically  significant  superiority. 


X  § 


P4 


P7 


P9 


Average  training  error 

R1  R2  R3  Kll 


K12 


K13 


K14 


K15 


0.0000 

0.0031 

0.0061 

0.0000 

0.0012 

0.0007 

0.0000 

0.0021 

0.0007 

0.0000 

0.0012 

0.0004 

0.0000 

0.0017 

0.0007 

0.0440 

0.0587 

0.0302 

0.0038 

0.0132 

0.0024 

0.0007 

0.0044 

0.0015 

0.0015 

0.0087 

0.0041 

0.0011 

0.0033 

0.0008 

0.0  1.0 
§  0.5  0.5 
Hh  0.5  1.0 
k  1.0  0.5 
1.0  1.0 


0.0011  0.0072  0.0153  0.0064 
0.0001  0.0018  0.0025  0.0012 
0.0001  0.0025  0.0037  0.0020 
0.0000  0.0015  0.0022  0.0012 
0.0001  0.0026  0.0029  0.0018 


0.0049  0.0024  0.0626  0.0418 
0.0018  0.0006  0.0299  0.0111 
0.0025  0.0007  0.0358  0.0154 
0.0013  0.0004  0.0283 


0.0019  0.0006  0.0334 


0.0086 

0.0116 


0.0  1.0 
O  0.5  0.5 
Q  0.5  1.0 
1.0  0.5 
1.0  1.0 


0.0030  0.0034  0.0147 
0.0004  0.0017  0.0029 
0.0001  0.0008  0.0004 
0.0003  0.0020  0.0008 
0.0001  0.0003  0.0003 


0.0071  0.0043  0.0030  0.0709  0.0444 
0.0023  0.0014  0.0018  0.0455  0.0090 
0.0006  0.0004  0.0002  0.0294  0.0029 
0.0014  0.0015  0.0004  0.0504  0.0063 
0.0008  0.0004  0.0004  0.0294  0.0047 


0.0 

1.0 

0.0518 

0.0742 

0.0758 

0.0744 

0.0811 

0.0097 

0.2025 

0.3049 

0.1552 

0.2145 

0.0723 

§  °-5 

0.5 

0.0323 

0.0968 

0.0732 

0.0834 

0.0608 

0.0156 

0.1769 

0.2328 

0.1040 

0.1138 

0.0608 

A  °-5 

1.0 

0.0449 

0.0926 

0.0638 

0.0792 

0.0880 

0.0115 

0.1781 

0.2128 

0.1267 

0.1603 

0.0748 

M  1.0 

0.5 

0.0217 

0.0882 

0.0715 

0.0663 

0.0666 

0.0078 

0.1598 

0.2005 

0.0866 

0.1690 

0.0623 

1.0 

1.0 

0.0282 

0.0845 

0.0792 

0.0698 

0.0754 

0.0120 

0.1942 

0.2479 

0.1437 

0.1724 

0.0628 

x  S 


P4 


P7 


P9 


R1 


Median  test  error 

R2  R3  Kll 


K12 


K13 


K14 


K15 


0.0  1.0 1 1 0.0009  0.0084  0.0342  0.0071  0.0044  0.0026  0.0555  0.0529  0.0000  0.0028  0.0057 


§  °-5 

0.5 

0.0000 

0.0046 

0.0256 

0.0025 

0.0123 

0.0030 

0.0425 

0.0581 

0.0000 

0.0017 

0.0008 

ft  0.5 

1.0 

0.0000 

0.0045 

0.0142 

0.0037 

0.0045 

0.0015 

0.0333 

0.0290 

0.0000 

0.0032 

0.0008 

b  1.0 

0.5 

0.0000 

0.0069 

0.0306 

0.0030 

0.0042 

0.0017 

0.0260 

0.0311 

0.0000 

0.0021 

0.0005 

1.0 

1.0 

0.0000 

0.0055 

0.0227 

0.0025 

0.0027 

0.0009 

0.0300 

0.0295 

0.0000 

0.0024 

0.0008 

0.0 

1.0 

0.0039 

0.0593 

0.0346 

0.0087 

0.0145 

0.0071 

0.1089 

0.0988 

0.0215 

0.0774 

0.0185 

0  0.5 

0.5 

0.0025 

0.5159 

0.0469 

0.0406 

0.0148 

0.0028 

0.0738 

0.0374 

0.0070 

0.0252 

0.0036 

D  0.5 

1.0 

0.0117 

0.3084 

0.0715 

0.1522 

0.0652 

0.0618 

0.0639 

0.2124 

0.0022 

0.0556 

0.0097 

Ph  1.0 

0.5 

0.0006 

0.0704 

0.0104 

0.0081 

0.0319 

0.0057 

0.0445 

0.0364 

0.0030 

0.0283 

0.0014 

1.0 

1.0 

0.0097 

19.486 

8E+3 

0.0607 

0.0466 

0.0155 

0.0402 

0.3878 

0.0023 

0.0378 

0.0026 

0.0 

1.0 

0.0485 

0.1170 

0.1017 

0.0836 

0.0585 

0.0123 

0.2005 

0.2649 

0.1986 

0.1458 

0.0814 

£  o-5 

0.5 

0.0240 

0.0958 

0.0810 

0.0730 

0.0592 

0.0106 

0.1770 

0.1874 

0.1122 

0.0988 

0.0525 

2  0.5 

1.0 

0.0572 

0.1865 

0.0922 

0.0800 

0.0694 

0.0105 

0.1686 

0.2037 

0.1311 

0.1207 

0.0882 

W  1.0 

0.5 

0.0191 

0.0899 

0.0785 

0.0711 

0.0641 

0.0101 

0.1493 

0.1874 

0.0998 

0.1126 

0.0446 

1.0 

1.0 

0.0257 

0.0734 

0.0725 

0.0739 

0.0727 

0.0142 

0.1894 

0.1853 

0.1843 

0.1723 

0.0389 

Averag 

e  program  size 

X 

§ 

P4 

P7 

P9 

R1 

R2 

R3 

Kll 

K12 

K13 

K14 

K15 

0.0 

1.0 

172.6 

179.0 

195.9 

162.2 

187.9 

161.0 

210.9 

172.4 

9.1 

204.9 

207.7 

S  0.5 

0.5 

150.4 

341.4 

322.1 

325.3 

352.8 

347.7 

328.3 

305.5 

7.4 

326.5 

260.6 

ft  0.5 

1.0 

78.3 

292.4 

271.7 

287.9 

283.3 

265.9 

286.4 

264.5 

8.5 

258.9 

239.6 

fc  1.0 

0.5 

44.0 

346.6 

354.4 

327.2 

339.2 

311.0 

328.0 

311.1 

7.8 

298.6 

300.0 

1.0 

1.0 

99.2 

283.4 

271.0 

255.7 

253.3 

270.6 

244.8 

230.6 

8.9 

239.8 

264.4 

0.0 

1.0 

537.6 

690.6 

550.8 

777.5 

2656.9 

1203.7 

418.6 

434.8 

85.0 

147.2 

250.2 

o  0.5 

0.5 

503.6 

637.9 

686.0 

493.9 

529.6 

485.7 

358.4 

482.4 

497.1 

346.4 

1299.6 

Q  0.5 

1.0 

626.9 

1004.3 

934.1 

906.7 

854.0 

747.2 

654.2 

841.2 

464.3 

548.6 

1137.2 

Ph  1.0 

0.5 

378.6 

631.2 

588.4 

473.0 

508.9 

486.7 

316.8 

472.9 

311.0 

325.5 

673.8 

1.0 

1.0 

645.6 

903.6 

909.9 

668.6 

746.4 

696.2 

542.9 

838.7 

426.7 

514.6 

1034.4 

0.0 

1.0 

122.9 

176.1 

152.4 

133.8 

116.2 

155.9 

109.3 

95.1 

95.7 

63.0 

74.7 

§  0.5 

0.5 

60.0 

109.4 

95.4 

79.7 

76.1 

95.8 

62.7 

69.4 

59.6 

53.5 

57.5 

A  0.5 

1.0 

111.8 

172.8 

159.6 

154.3 

122.3 

173.6 

99.2 

95.3 

96.1 

79.1 

82.0 

®  l.o 

0.5 

97.9 

106.4 

107.1 

96.8 

89.9 

137.2 

89.5 

86.6 

81.1 

87.6 

64.4 

1.0 

1.0 

119.0 

160.9 

147.5 

150.6 

131.0 

165.7 

95.3 

83.2 

95.9 

80.4 

96.5 
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